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Preface 



After two successful meetings in Vorau (2001) and Vancouver (2002), the Interna- 
tional Conference on Robust Statistics 2003, ICORS 2003, took place in Antwerp, 
Belgium, from July 13-18. The conference was intended to be a forum where all 
aspects of robust statistics could be discussed. As such the scientific program of- 
fered a wide range of talks on new developments and practice of robust statistics 
with applications in finance, chemistry, engineering and others. Of equal interest 
were interactions between robustness and other fields of statistics, and science in 
general. 

The conference was attended by 134 participants of 28 different countries from 
all over the world. The program contained 78 oral and 14 poster presentations. 
This volume presents a wide range of papers that were presented at the confer- 
ence. As most of the contributions contain both theoretical and empirical results, 
we prefer to present them in alphabetical order of the first author. A rough classifi- 
cation could be made as follows: new methods and theoretical results are presented 
by J.G. Adrover and V.J. Yohai, R.B. Dimova and N.M. Neykov, P.J. Hingley, 
A. Kankainen et ah, A. Kharin, A. Marazzi and V.J. Yohai, L. Masicek, D.J. Olive, 
M.R. Oliveira et al, J.F. Ortega, C. Ortelli and F. Trojani, D.W. Scott, S.S. Syed 
Yahaya et ah, S. Taskinen et al., A. Toma, and B. Vandewalle et al. Empirical 
properties of robust methods are studied in A. Christmann, P. Cizek, A. Durio 
and E.D. Isaia, S. Engelen et al., K. Joossens and C. Croux, and J. Jureckova 
and J. Picek. Computational aspects are treated in C. Chen, S. Copt and M.- 
P. Victoria- Feser, S. Morgenthaler et al., D.M. Mount et al., and C. Zioutas. Ap- 
plications in finance, engineering and computer vision are emphasized in L. Aucre- 
manne et al., C. Bassett et al., E. Ollila and V. Koivunen, D. Suter and H. Wang, 
L. Thomassen and M. Van Wouwe, and S. Vanlanduit and P. Cuillaume. A sur- 
vey of the connection between computational geometry and statistical depth is 
presented in E. Rafalin and D.L. Souvaine. 

We are very grateful to all those who where involved in the organization of 
the conference and this proceedings volume. Our sincere thanks go to the local 
organizers from the University of Antwerp: Cuy Brys, Ellen Vandervieren, Ka- 
trien Van Driessen, Martine Van Wouwe, Sabine Verboven and Cert Willems. We 
highly appreciate their enthusiasm and accuracy. We thank Peter Rousseeuw for 
his indispensable help and financial support. Michiel Debruyne, Sanne Engelen 
and Karlien Vanden Branden helped us with the type-setting of this volume. Our 
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thanks go to the members of the scientific committee and many anonymous referees 
who carefully reviewed the papers in this volume. Their names are listed behind 
the table of contents. We also express our thanks to the University of Antwerp, the 
Minerva Foundation at Princeton - New Jersey, the Fund for Scientific Research 
Flanders, the Belgian Statistical Society, KBC Banking and Insurance and SAS 
Institute for their generous financial support. Finally, we thank all participants for 
making IGORS 2003 a successful meeting. 

We hope that this book will provide a state-of-the-art overview of robust 
statistics, and that it will encourage both researchers and practitioners to study 
and to apply robust statistical methods in the future. 



Mia Hubert 
Greet Pison 
Anja Struyf 
Stefan Van Aelst 
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Bias Behavior of the Minimum Volume 
Ellipsoid Estimate 

J.G. Adrover and V.J. Yohai 



Abstract. Rousseeuw introduced the Minimum Volume Ellipsoid (MVE) esti- 
mates of covariance matrix and multivariate location. These estimates, which 
are broadly used, are affine equivariant and have high breakdown point. Croux 
et al. (2002) derived the maximum bias curve for point mass contaminations 
of a non equivariant version of the MVE. In this paper we obtain a similar 
result for the equivariant version of the MVE estimates. 

Mathematics Subject Classification (2000). Primary 62F35; Secondary 62H12. 
Keywords. Minimum volume ellipsoid, maximum bias, robust estimates, mul- 
tivariate location. 



1. Introduction 

The first equivariant estimate of multivariate location and covariance matrix with 
high breakdown point for all p was proposed independently by Stahel (1981) 
and Donoho (1982). Rousseeuw (1985) proposed the Minimum Volume Ellipsoid 
(MVE) estimates of multivariate location and scatter. These broadly used robust 
estimates are defined as the center and scatter of the ellipsoid of minimum vol- 
ume covering half of the data. These estimates are affine equivariant and have 
asymptotic breakdown points equal to 0.5. 

Let Z = {zi, . . . ,Zn} be a dataset in R^. For a vector t and a matrix V G 
Vp, where Vp denotes the set of positive definite p x p matrices, the squared 
Mahalanobis distances are defined as 

di =d^{t,V) = (zi-t)V“^(zi-t). (1.1) 

Actually, the MVE belongs to a broader family of estimates based on a scale 
estimate, which are defined as follows. Consider a scale estimate Sn ’ R^ R>o^ 

Partially supported by grant PICT-99 03-06277 from the Agencia Nacional de Promocion de 
la Ciencia y la Tecnologia, grant X611 from the Universidad de Buenos Aires, Argentina and a 
grant from Pundacion Antorchas, Argentina. 




2 



J.G. Adrover and V.J. Yohai 



{R>o is the set of non negative real numbers) satisfying 5n(Ax) = |A|5n(x). Then 
the estimates (T, V*) are defined by 

(f , y*) = argmin 5n(di(t, F), . . . , 4(t, V)). (1.2) 

teRP, VeVp, det(V)=l 

The vector T is the estimate of location and the matrix V* gives the shape of 
the covariance matrix. To account also for the size, the estimate of the covariance 
matrix is defined by 

V=^CpSl{d,{f,V%...,dn{T,V*))V*, 

where Cp is a constant. Generally Cp is chosen to make the estimate consistent for 
the multivariate normal family. If *S'n(xi, . . . ,x^) == ((xf + • • • + x^)/n)^/^, then 
T and V are respectively the sample mean and the sample covariance matrix. If 
5'n(xi, . . . ,Xn) = median! |xi I, . . . , jx^l}, we have the MVE. Davies (1987) studies 
the case where Sn is an M-estimate of scale. 

If z is a random vector, we denote by Q(^(z),F, a) the quantile a of g{z) 
when z has distribution F. Then the location and scatter shape MVE estimating 
functionals T(F), V*{F) are defined by 

{T{F),V*{F))= argmin Q((z - t)'l/-^(z - t),F,0.5) (1.3) 

t€i?P,VePp,det(V) = l 

When there are more than one value attaining this minimum, (T(F), V*{F)) 
is taken to be the subset of all these values. The objective function Q(-,F, 0.5) 
evaluated at the solutions given in (1.3) is the appropriate factor to stretch the 
shape matrix and yield an ellipsoid covering half of the data. 

A good measure of the robustness of an estimate is the maximum bias 
(maxbias) curve, which gives the maximum asymptotic bias of the estimate caused 
by a given fraction of contamination. Yohai and Maronna (1990) obtained the 
maxbias curve for the MVE covariance estimate in case the location is known. 
Croux et al. (2002) considered an orthogonal equivariant version of the MVE, the 
Minimum Volume Ball (MVB) which is defined as 

T(F) = argmin (3((z - t)'(z - t),F, 0.5). (1.4) 

teRp, 

The authors calculated the maxbias curve of the MVB for point mass contamina- 
tions at the normal central model assuming that the covariance matrix is known. 
Croux and Haesbroeck (2002) derived the maxbias curve of the MVE location 
estimator with unknown scale parameter in the univariate setting under all types 
of contamination. 

In this paper we derive the point mass maxbias curves for the MVE simul- 
taneous estimates of location and covariance matrix shape assuming an elliptical 
unimodal and continuous density. These maxbias curves are tabulated for the case 
of the multivariate normal model. 
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2. Maxbias Curve for the MVE 

Let Fq be a distribution on Rp and 0 < ^ < 1, then the point mass contamination 
neighborhood of size e is defined by 

A/*Fo,s = {F: F = {1- e)Fo + z eRP}. 

Consider an elliptical family of distributions F^ yi on where ^ RP and 
Ti eVp . Then F^^^y has density of the form 

f{z,n, E) = - n)). (2.1) 

Let (T(F),F*(F)) be affine equivariant and Fisher consistent estimating 
functionals of location and of the shape of the covariance matrix respectively. 
Then The maxbias curves of T(F) and V*{F) at F^^ y for point 

mass contaminations of size £ are defined by 



and 



B{T,F^,^,e) = sup{(T(F) - /i)'E-i(T(F) - fi) : F e 



B{V*,F^,^,e) = snp 



Ai (E-i/2y*(F)*E-V2) 



FeMi 



F(j,,'£,e ? 5 



respectively, where given a positive definite pxp matrix A, Ai (A) < \ 2 {A) < • • • < 
Xp{A) are its eigenvalues. 

Due to the affine equivariance property of T, it is immediate that the maxbias 
curves of T and V* do not depend on /x and S, and therefore we can write 

F(T,Fo, 7 ,£) = sup{||T(F)|| : F e 

and 

According to this definition, when (T,F*) is the MVE estimating functional. 



we have 


B{T,Foj,e) = sup{||to|| : 


(to, Vb,zo) e M}, 


and 


= {sup 


• (to, Vo, zo) € M 


where 






M = 


{(to, Vb,zo) : 





(to,Vb)= argmin Q((z - t) V ^(z - t), (1 - e)Fo,/ + 0.5)}. 

t,K6'Pp,det(V)=l 

Because of the equivariance of the MVE, it is easy to show that without loss of 
generality we can restrict these minima to points zq of the form ( 2 : 0 , 0, . . . , 0). Then 
it is possible to prove that we can also restrict t to points of the form (/x, 0, . . . , 0) 
and V* to matrices of the form didig{v,v~^^^P~^\ . . . where > 0 (see 
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Adrover and Yohai, 2003). In this case we can write {z — tyv ^(z- 1 ) = h(z, //), 
where 

h{z, V, n) = ^ z'f. 

i=2 

Then we can also write 

B{T,Foj,e) = max{/xo : {i^o,vo,zq) £ M} 

and 

B{V*,Foj,e) = : {no,vo,zo) £ M}, 

where 



M = {(mo,wo, zq) : /xo > 0, vq> 0, zq > 0, 

{no, Vo) = argmin( 5 (/i(z,t;,iii),(l -e)Fo,/ + e^(^o,o,...,o),0.5)}. 

IjL^V 



From now on, e is going to be fixed, then we are not going to make explicit 
the dependence on e of some of the functions we define. 

Put = (0.5-e)/(l-£), 6:2 = 0.5/(l-e) and Ci = Q(||z|p,Fo,/,£:z), i = 1 , 2 . 
Let 

m{v, n) = Q{h{z, V, n), Foj,ei). 



Then we have the following lemma. 



Lemma 2 .1. Assume that the function g in (2.1) is strictly decreasing and contin- 
uous. Then 

1. The function m{v^ n) is continuous. 

2. The unique minimum of m{v,ii) is attained at v = 1 and /a = 0. 

3. The function m{v, /x) is strictly increasing on jj, for fixed v and /x > 0. We 

also have lim^^oo m{v^fi) = oo. 

4. The function m{v,Q) is strictly increasing on v for v > 1. We also have 

lim^;^oo m{v, 0 ) = oo. 

5. For any c\ < c < C 2 there exists a unique v(c) > 1 such that m(xJ(c),0) = c. 
The function v{c) is strictly increasing and continuous and v{ci) = 1. 

6 . For any c\ < c< C 2 and I <v < v{c), there exists a unique value l{c^v) > 0 

such that m{vA{c^v)) = c. The function l{c,v) is continuous and strictly 
increasing on c. Put V 2 = 0 ( 02 ), then Z(c 2 ,f 2 ) = 0 = /(ci, 1). 

7. Define for any c, ci < c < C 2 , the values r*(c), x;*(c) and /x*(c) by 

v*{c) = argmax Z(c, u) + (cx;)^/^, 

r*{c) = /(c,w*(c)) + (cv*(c))^/^, 

H*{c) =l{c,v*{c)). 

Then, the function r* (c) is strictly increasing and continuous. 
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Proof ( 1 ) We can write 

P{h{z,v, fj) < 5) = j g{v{yi+iif^- Vi ) dy, 

where Cs = {y ^ RP: ||y|| < (5}. Then part (1) of the Lemma follows from the 
Dominated Convergence Theorem. 

(2) It is enough to show that for any {v, n) 7 ^ ( 1 , 0) and > 0 

P{h{z,v, ^i)<S)< P(/i(z,l, 0) < 5). (2.2) 

Let Do = {z G jR^:||z||^ < 5} and 

p 

Di = {z e BP'. [zi - Ilf jv + ^ < 5}. 

2=2 

Put Do = Do — Do n Di and Di = Di — Do fl Di. Then, since 



dz= dz, 



we also have 



dz= dz. 



Since g is strictly decreasing, we have 



P{h{z,l,0) < S) = [ g{\\z\f)dz+ [ 

J DqC\D\ d Eq 

> [ g{\\z\\^)dz + g{6^) f 

J DqC\D\ j Eq 



|zl|^)(iz 



P{h{z,v,y) <S) = [ g{\\z\f)dz+ [ g{\\z\f)dz (2.5) 

J DqC\D\ j El 

< [ 9 {\M^)dz + g{S'^) [ dz. 

J Dor\D\ J El 

From (2.3), (2.4) and (2.5) we get (2.2), and part (2) of the Lemma follows. 

(3) Take e >l and /io < /^i- Put D^ = {z G R^: h{z,v, fii) < d}, i = 0, 1 and 
= Di — Do n Di, i = 0, 1. Then to prove that m{v^ g) is strictly increasing on 
/i, it will be enough to show that 



/ g{\\z\\^)dz> g{\\zr)dz. 


(2.6) 


d Eq d El 




Let ft = {go + gi)/2, then 




z = {zi, . . . ,Zp) e Eq ^ Zi < g 


(2.7) 


and 




z = (zi , . . . , Zp) e El ^ Zi > g. 


(2.8) 
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Consider the transformation k{zi, . . . ,Zp) = (2/x — zi, Z 2 , . . . , Zp). Then it is easy 
to show that k(Eq) = Ei. Moreover by (2.7), (2.8) and the fact that g is strictly 
decreasing we obtain 

g{\\k{z)f)<g{\\zr)yzeEo. (2.9) 

Since the Jacobian of k has absolute value one, using the change of variable 
formula in multiple integrals and (2.9), we get 

/ = [ g{\\K{z)f)dz < f g{\\zf)dz, 

J E\ J Eq J Eq 

and therefore (2.6) holds. The proof of lim^^oo = oo is immediate from 

the fact that i7(z,t?,/i) ^ (X) in probability when ju — > oo. 

(4) The fact that m{v, 0) is strictly increasing in r? for > 1 follows from the 
Lemma in Section 3 of Yohai and Maronna (1990). The proof of lim^_,oo m{v^ 0) = 
(X) is immediate from the fact that H{z,v^0) oo in probability when v oo. 

(5) Prom the definition of p(c), it follows that v{ci) = 1. Then, since m(l, 0) = 
Cl < c and lim^_,oo ^('t',/^) = oo, the continuity of m implies that there exists 
v{c) satisfying m(p(c),0) = c. Part (4) of the Lemma implies that v{c) is strictly 
increasing. The continuity of v{c) follows easily from the continuity of m. 

(6) The existence and uniqueness of l{c.,v) follows from the continuity of 
m, the facts that 771 ( 1 ;, 0) < c and lim^_^oo '^(^?0) = 00 . Part (3) of the Lemma 
implies that /(c, 7 ;) is strictly increasing on c. To show the continuity of /, consider 
a sequence (cn,7;n) ^ (c, t;). Suppose that l{cn,Vn) l{c^v) does not hold, then 
without loss of generality we can assume that 

l{Cn,Vn) 1{C,V). (2.10) 

However by the continuity of m we have 

lim m{Vn,l{Cn, Vn)) = c = m{v, lo). 

n— >oo 

Then Iq = l{c^v)^ contradicting (2.10). 

(7) Take c < c*. Then, since v*{c) < v{c) < t;(c*) by part (6) of the Lemma 
we have 

r*{c*) > l{c* ,v*{c)) + {c*v*{c)Y/^ 

> l{c,v*{c)) + (ct*(c))^/^ = r*{c), 

and therefore r* (c) is strictly increasing. 

We now show that r* is continuous at any point c. Suppose this is not true, 
then there exists a sequence c^ such that Cn ^ c and v*{cn) vq ^ v*{c). Then 

l{c,vo) + < l{c,v*{c)) + (cv*(c))^/^. (2-11) 

On the other hand, by the continuity of /(c, v), we have 

l{c,Vo) + = lim l{Cn,V*{Cn)) + (c„w*(c„))^/^ (2.12) 

n— KX) 



l{c,v 



(c)) + (ct*(c))^/^ = lim l{Cn,V*{c)) + {CnV*{c)f/‘^. 
n^oo 



and 



(2.13) 
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Prom (2.11), (2.12) and (2.13) we obtain that there exists no such that 
l{Cno,V*{Cno)) + {CnoV* {Cno)Y^^ < l{c„„,V*{c)) + (c))^/^ 

and this contradicts the definition of V*{Cno)- 

The next lemma gives the median of any statistics under a point mass con- 
taminated distribution. 

Lemma 2.2. Given a distribution G on RP^ and a Borel measurable function t : 
RP i?, we have 

Q{t{z),Foj,Si) < Q{t{z),{l - s)Foj + eG,0.5) < Q{t{z),Foj,S 2 ). 
Moreover 

Q{t{z), (1 - e)Foj + £(5zo, 0.5) 

Q(i(z),Fo,/,ei) if t{zo) < Q{t{z),Foj,ei) 
t{zo) if Q{t{z),Foj,ei) < t{zo) < Q{t{z),Foj,e 2 ) 

Q{t{z),Foj,S 2 ) if t{zo)> Q{t{z),Foj,S 2 ). 

In the following theorem we see that the location MVE estimate turns out 
to be the origin for point mass sufficiently large or small. In any other case, the 
location estimate under contamination coincides with the center of the ellipsoid 
whose contour contains the point mass. 

Theorem 2.3. Assume that the function g in (2.1) is strictly decreasing and con- 
tinuous, and let (T,V*) be the MVE estimates of location and covariance matrix 
shape. Then, 

(i) Let ki = r*{ci), i — 1,2, then, for any ki < k < k 2 we have 

T((l - e)Fo,i + ^fc,o,...,o)) = 0, . . . , 0) 

y*((l — s)Fqj + e5(fc,o,...,o)) = 

(ii) If k > k 2 or k <k\ 

T((l - s)Fqj + e<5(fc,o,...,o)) = (0, 0, . . . , 0) 

y*((l-£)Fo,/ + e<5(fe,o,...,o)) =!■ 

(iii) For k = k 2 there are two solutions for (T((l — s)Fqj + ff<5(fc,o,...,o))) ^*((1 — 
s)Fqj + e^(fc,o,...,o)))i the ones given in (2.14) and (2.15). 

Proof We start proving (i). Put c — r*~^{k). Then 

Q{h{z,v*{c),n*{c)),Foj,Si) =m{v*{c),n*{c)) = c 

and 

h{{k,0,...,0),v*{c),ia*{c)) = c. 

Then, according to Lemma 2.2 we have 

Q{h{z,v*{c),^i*{c)),{l-£)Fo,i + e5^kfi,...,o)A5) = c. (2.16) 

Now we prove that if {n,v) ^ (/i*(c), n*(c)), then 

Q{h{z,v,n),{l - e)Foj + > c. 



(2.14) 

(2.15) 




(2.17) 
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Suppose first that fi> l{c,v). Then 

Q{h{z, V, li), Fo,/, £ 1 ) = m(u, //) > m(u, l{c, v)) = c, 
and by Lemma 2.2, (2.17) holds. Suppose now that /i < l{c,v). In this case 
k = /(c, u*(c)) + {cv*{c)Y^‘^ > l{c, v) + {cvY^‘^ > M 
and therefore 

/i((fc, 0 , . . . , 0 ), u, n) = — — — > c. (2.18) 

V 

We also have 

Q(/i(z, u, /i), Fo,/, 62 ) > Q(h(z, 1,0), Fo,/, £ 2 ) = C 2 > c. (2.19) 

(2.18) and (2.19) imply (2.16). Then (i) is proved. 

Take k > ^ 2 - Observe first that Q(h(z, l, 0 ),Fo,/,£ 2 ) = C 2 and therefore, 
because of Lemma 2.2, for any k 

Q(h(z, 1, 0), (1 - e)Foj + 0.5) < C 2 . (2.20) 

Given any (/x,u) ^ (0,1), put c = m(u,/x). There are three possible cases (a) 
V > U 2 , (b) u < U 2 , /i > l{c 2 ,v) and (c) u < U 2 , /i < l{c 2 ^v). In cases (a) and (b) 
we have m(u, /i) > C 2 and therefore by Lemma 2.2 

g(/i(z,u,/x),(l -s)Fo,/ + 6(5(fc,o,...,o),0.5) > C 2 . ( 2 . 21 ) 

In case (c), 

k > /(C 2 ,U*(C 2 )) + (C 2 U*(C 2 ))^/^ > l{c2,v) + (C 2 ^)^/^ > /^ + (c 2 u)^/^ 
and therefore 

>C2- (2.22) 

V 

Since Q(h(z, u, //), F q,/, £ 2 ) > C 2 , using Lemma 2.1 and (2.22) we get once 
again (2.21). This shows (ii) when k > k 2 . 

Finally consider the case k < ki. Since Q(h(z, 1, 0 ), Fq,/, £ 1 ) = ci, ki = 

and 

h((fc, 0 ,..., 0 ),l, 0 ) = fc 2 <kl=cu (2.23) 

by Lemma 2.2 we have 

Q(h(z, 1 , 0 ), (1 - e)Foj + 0.5) < ci. (2.24) 

Let now (/x,u) ^ (0,1), then we have Q(h(z, 1 ;, ^), Fq,/, ^ i) > ci, and by Lemma 
2.2 

(5(/i(z,u,//),(l -e)Fo,/ + £:(5(fc,o,...,o),0.5) > ci. (2.25) 

Then, by (2.24) and (2.25) we conclude that (ii) holds for k < k\. (hi) is proved 
combining the proofs of (i) and (ii). 
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Corollary 2.4. Assume that the function g in (2.1) is strictly decreasing and con- 
tinuous and let {T,V *) be the MVE estimates of location and covariance matrix 
shape. Then 

5(T,Fo,/,e)= sup n*{c). 

Cl <C<C2 

and 

B{V*,Foj,e)= sup 

Ci<C<C2 

Remark 2.5. In the normal case there is numerical evidence that fi* and v* are 
strictly monotone. Therefore 

B{T,Foj,e) = fi%C2) (2.26) 

and 

B{V*,Foj,e) = v*{c2Y/^P-^l 

Remark 2.6. The derivation of Lemma 2.1 and Theorem 2.3 relies on the fact of 
having a unique maximum for /(c, v) + (cv)^/‘^ for each c G [ci, C 2 ]. By these means, 
v*{c) is well defined. We have not dealt with the case of multiple values achieving 
the maximum to avoid a relentlessly technical presentation of the problem. 

Remark 2.7. Croux and Haesbroeck (2002) studied the maxbias curve for the MVE 
estimate of location in the univariate setting. If F = Fq,i and H^{x) and H~{x) 
stand for the solutions 5 of the equations 

F{x + s) - F{x -s) = 

F{x + s) - F{x -s) = 

respectively, then Proposition 3 of Croux and Haesbroeck (2002) states that the 
maximum bias of the univariate MVE estimate of location is the solution b of the 
equation 

F,-(6) = /f+(0). (2.27) 

The authors emphasize that an interesting feature of the maxbias of the MVE 
location estimator is that the most unfavorable contaminating distribution is a 
point mass distribution 5z with z not at infinity but much closer to the center of 
the distribution. Actually, Corollary 2.4 and (2.26) amounts to (2.27). To see that, 
let us observe that the formulation for R^ is much more simple because v = 1. 
Then, the dependence on v of the functions m{v,ja), l{c^v) and h{z^v^ii) can be 
omitted. Consequently, /a* ( 02 ) = l{c 2 ) and 

[ 77 +( 0)]2 = Q{z^,Fs 2 ) = C 2 = m{l{c 2 )) 

= Q{hiz,l{c2)),F,eY = [HF{l{c2))?. 

In Table 1 and Figure 1 we show the maxbias curves of the MVE estimate of 
multivariate location and in Table 2 and Figure 2 those of the covariance shape. 
In Table 3 we show the values of zq such that F = {1 — e)Foj + ^^( 20 , 0 ,. .., 0 ) is the 
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least favorable point mass contaminated distribution. For comparison with other 
estimates, see Table 2 of Adrover and Yohai (2002). Adrover and Yohai (2002) 
observed that the maximum biases of the MVE and the Stahel-Donoho estimate 
(1982) increase with the dimension in the location case. The bias performance 
of MVE compares favorably with respect to the Minimum Covariance Determi- 
nant (Rousseeuw, 1985) although its performance is rather poor compared to the 
projection estimate (Tyler, 1994). 



TABLE 1. Maximum biases of the MVE TABLE 2. Maximum biases of the MVE 



location estimate. covariance shape estimate. 



p 






e 






P 






s 








0.01 


0.05 


0.10 


0.15 


0.20 




0.01 


0.05 


0.10 


0.15 


0.20 


2 


0.17 


0.40 


0.63 


0.91 


1.29 


2 


1.36 


4.45 


9.49 


18.66 


37.33 


3 


0.17 


0.41 


0.69 


1.03 


1.48 


3 


1.43 


3.58 


6.86 


12.62 


24.09 


4 


0.17 


0.42 


0.72 


1.08 


1.57 


4 


1.47 


3.45 


6.65 


12.56 


24.94 


5 


0.16 


0.42 


0.73 


1.12 


1.65 


5 


1.51 


3.48 


6.90 


13.54 


28.12 


6 


0.16 


0.42 


0.74 


1.14 


1.71 


6 


1.55 


3.57 


7.31 


14.90 


32.37 


7 


0.16 


0.42 


0.74 


1.16 


1.78 


7 


1.57 


3.67 


7.84 


16.47 


37.35 


8 


0.15 


0.41 


0.75 


1.18 


1.84 


8 


1.60 


3.81 


8.30 


18.21 


43.00 


9 


0.15 


0.42 


0.75 


1.20 


1.90 


9 


1.63 


3.92 


8.84 


20.07 


49.32 


10 


0.14 


0.41 


0.75 


1.22 


1.97 


10 


1.66 


4.86 


9.41 


22.06 


56.31 


15 


0.14 


0.40 


0.76 


1.31 


2.30 


15 


1.76 


4.71 


12.45 


33.83 


102.63 


20 


0.13 


0.39 


0.77 


1.41 


2.65 


20 


1.85 


5.37 


15.04 


48.90 


172.45 



TABLE 3. Least favorable zq. 



P 






£ 








0.01 


.05 


0.10 


0.15 


0.20 


2 


1.55 


2.18 


2.86 


3.68 


4.75 


3 


2.01 


2.91 


3.80 


4.98 


6.37 


4 


2.40 


3.41 


4.65 


6.22 


8.45 


5 


2.73 


3.93 


5.46 


7.48 


10.43 


6 


3.04 


4.43 


6.26 


8.75 


11.95 


7 


3.33 


4.90 


6.75 


10.05 


14.74 


8 


3.59 


5.36 


7.84 


11.39 


17.06 


9 


3.85 


5.81 


8.62 


12.75 


19.52 


10 


4.09 


6.24 


9.40 


14.15 


22.11 


15 


5.17 


8.30 


13.36 


21.71 


37.14 


20 


6.12 


10.27 


17.47 


30.31 


56.04 
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A Study of Belgian Inflation, Relative Prices 
and Nominal Rigidities using New Robust 
Measures of Skewness and Tail Weight 

L. Aucremanne, G. Brys, M. Hubert, P.J. Rousseeuw and A. Struyf 

Abstract. This paper studies the distribution of Belgian consumer price chan- 
ges and its interaction with aggregate inflation over the period June 1976- 
September 2000. Given the fat-tailed nature of this distribution, both classical 
and robust measures of location, scale, skewness and tail weight are presented. 

The chronic right skewness of the distribution, revealed by the robust mea- 
sures, is cointegrated with aggregate inflation, suggesting that it is largely 
dependent on the inflationary process itself and would disappear at zero in- 
flation. 

Mathematics Subject Classification (2000). Primary 91B84; Secondary 62G32. 
Keywords. Belgian inflation data, robustness, skewness, tail weight. 



1. Introduction 

The aim of this paper is to study the cross-sectional properties of Belgian inflation 
data. The international literature on this issue highlights, broadly speaking, three 
not mutually exclusive aspects. First, many papers discuss the positive relationship 
between inflation on the one hand and the dispersion and/or the asymmetry of 
relative prices on the other. Often these relationships have been interpreted as 
being symptomatic of nominal rigidities of one form or another. Ball and Mankiw’s 
menu cost model (Ball and Mankiw, 1995) has recently been the focal point of this 
strand in the literature. Second, in line with the findings of Bryan et al. (1997), 
it is often documented that inflation data are fat-tailed and this motivates the 
stochastic approach to core inflation in which robust estimators of location are 
proposed. Third, in quite a few countries researchers found not only a considerable 
degree of frequently switching left and right skewness, but, on average, also a 
tendency towards right skewness. 
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This type of analysis often uses the classical characteristics of location, scale, 
skewness and kurtosis. These measures are, however, very sensitive to outlying 
values (Ruppert, 1987). In this paper we will compare them with robust alterna- 
tives, not only for location, as is typically done in the core inflation part of this 
literature, but also for scale, skewness and tail weight. We will address the ques- 
tion whether robust measures are applicable in this context, because inflation is 
often strongly influenced by outliers which are both correct and important. Overly 
strong robustness may downweight outliers too much and thus yield little sensi- 
tivity. Next, we study the relation between skewness and inflation, and interpret 
the results in the framework of menu cost models (Ball and Mankiw, 1994). 



2. Description of the Data 

This study is based on Belgian monthly consumer price index (CPI) data for 
the period June 1976 up to September 2000, yielding a total of 292 months. It 
was not possible to start at an earlier point in time, for instance during the low 
inflation regime of the sixties, for data availability reasons. For each month, we 
have aggregated price indices of 60 different product categories such as meat, 
clothing, tobacco, electricity, recreational and cultural services, etc. The index of 
product category i (i = 1, . . . , 60) at month t (^ = 0, . . . , 291) is denoted by Ii^t> 

We started by transforming these data into percentage 1-monthly price chan- 
ges Ui^t {t = I,. . . ,291), deflned by Ui^t = {U,t/h,t-i) ~ 1- For a motivation and 
more information about selecting Hi t instead of any other transformation of the 
raw data, we refer to Aucremanne et al. (2002). 

Summarizing, we can represent our flnal dataset with a matrix that consists 
of 291 rows (indicating the different periods) and 60 columns (for the product 
categories). A cross-section contains the price changes of one particular month, 
and thus corresponds with one row of the data matrix. Additionally, we have 
taken account of the fact that each product category has a time-varying weight 
Wi^t which is obtained by multiplying for each period t its flxed Laspeyres-type 
weight Wi by the change in its relative price between the base period and period t. 
The Laspeyres-type weights Wi of the CPI reflect the importance of each category 
in total household consumption expenditure in the base period. The weights satisfy 
= 1 for alH. In so doing, the weighted mean of the 60 product-specific 
price changes H^^^ corresponds to aggregate inflation. 



3. Detection of Outliers 

In this section we will mainly show that the Belgian price changes contain a sub- 
stantial number of outlying values. For this, we will consider the cross-sections 
and compute robust measures of tail weight for each of them. To show the im- 
portance of the tails of univariate data (as the cross-sections are), we can use the 
classical measure of kurtosis. In general, the kurtosis is said to characterize the 
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fatness of the tails, or equivalently the tail weight, but it also reflects the shape of 
the density in the center. Moreover, the classical measure of kurtosis is based on 
moments of the dataset, and thus it is strongly influenced by outliers. Despite the 
fact that outliers determine the tail weight, we can construct robust measures to 
compute it. For this purpose, we first need a robust estimator of location and scale 
for univariate data. In general, we denote the sample by (i = 1, . . . , n) and the 
corresponding weights by Wi. As usual, the weights sum to 1. Time subscripts are 
omitted to simplify notations. 

3.1. Robust Measure of Location 

A measure of location should estimate a value that characterizes the central po- 
sition of the data. The best known measure of location is the weighted mean or 
average, defined as x = Note that with the dataset we analyze, the 

weighted mean corresponds to observed inflation. 

A typical robust measure of location is the median. Here, we need the weighted 
variant of the median. In general, all measures based on percentiles can be mod- 
ifled easily to their weighted version by Ailing in the weighted percentiles. That 
is the main reason why all robust measures presented in this paper are based on 
percentiles. Initial work on other robust measures based on couples and triples of 
observations was abandoned precisely because it was not straightforward to con- 
struct them in weighted terms. Indeed, Brys et al. (2003) and Brys et al. (2004) 
discuss some unweighted robust skewness measures based on couples and triples. 
For the p% weighted percentile, denoted by Qp, we first sort the observations from 
smallest to largest, and then take the first value with cumulative weight higher 
than p%. The weighted median equals the 50% weighted percentile, or Qo.so- 



3.2. Robust Measure of Scale 

Scale characteristics measure how “spread out” the data values are. The classical 
measure of scale is the standard deviation, which is given by: 



s = 



YTi=iWj{xi - xf 

1 - E"=i 



(3.1) 



where x stands for the mean. This equation is based on the unbiased sample 
moments of unequally- weighted frequency distributions presented in Roger (2000). 

The interquartile range or IQR is a robust measure of scale which, like the 
median, is based on percentiles and is easy to calculate. The (weighted) IQR is 
the distance between the 75% percentile Q 0.75 and the 25% percentile Q 0 . 25 , or 



IQR — Qo.75 - Qo.25 



(3.2) 



3.3. Robust Measures of Tail Weight 

We propose several tailmass measures as alternative to the classical kurtosis. They 
simply count the outliers in a data set. To distinguish the outlying values from the 
regular ones we use the following outlier rules: 
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1. The general boxplot rejection rule defines outliers as points outside the inter- 
val: 

[Qo.50 - Qo.50 + 2^^^] 

This criterion is based on the definition of the whiskers in the univariate 
boxplot (Tukey, 1977). Note that this interval corresponds approximately to 
[x — 2s,x -h 2s] in case of the normal distribution. 

2. The asymmetric boxplot rule. This rule is a special case of the bivariate bag- 
plot (Rousseeuw et ah, 1999), which is a bivariate generalization of the box- 
plot. Points are considered outlying when they lie outside the interval: 

[Qo.50 - 3((3o.50 - Q 0 . 25 ), Qo.50 + 3(Qo.75 ~ Qo.5o)j (3.4) 

For each outlier rule, we define left outliers as those data points which are 
smaller than the lower bound of the interval defined in (3.3) or (3.4). Analogously 
the right outliers are larger than the upper bound of this interval. We then obtain 
the left-tailmass measures as the sum of the weights of the left outliers, i.e., 

n 

left-tailmass = (3.5) 

i=l 

where 9i equals 1 if Xi is a left outlier and 0 if not. This leads to left-tailmass(box) 
and left-tailmass(abp). Considering the right outliers, we obtain right-tailmass 
(box) and right-tailmass(abp). 

3.4. Results for Belgian Inflation Data 

All measures defined above were applied to the cross-sections of the Belgian in- 
flation data. In Figure 1 the resulting time series are plotted, by considering all 
291 consecutive cross-sections. On the plots, the scatter points of the different 
measures are depicted, together with a smoother (solid line). We have chosen to 
use a lowess smoother, which is a robust scatter plot smoother (Cleveland, 1979). 
Lowess uses robust locally linear fits, by placing a window about each point and 
weighting the points that are inside the window so that nearby points get the most 
weight. 
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Figure 1. Time series of the tail weight measures. 
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All the tailmass measures show substantial proportions of outliers and these 
proportions are, moreover, very volatile. This is why we will use robust measures 
of location, scale and skewness to analyze the data further on. Before doing so, 
we have a closer look at the tails. On the basis of the smoothed curves, left- 
tailmass(box) amounted to roughly 5% at the beginning of the sample and in- 
creased to approximately 10% at the end of the sample. Right- tailmass (box), in 
contrast, tends to oscillate around 15% during the whole sample. These observed 
tail weights are substantially in excess of those for the normal distribution, for 
which the corresponding tailmass(box) measures amount to 2.15% for each tail. 
This clearly illustrates the fat-tailed nature of Belgian inflation data, a finding 
which is in line with the results in the international literature on this issue. It 
should, however, be stressed that most of the international evidence is based on 
the classical kurtosis, which may for several reasons underestimate the importance 
of outliers (see for instance Aucremanne, 2000, on this so-called masking issue). 

The fact that right- tailmass (box) tends to have larger values than the cor- 
responding left alternative is a first indication of the existence of chronic right 
skewness, albeit apparently decreasing over time. A possible disadvantage of the 
general boxplot rejection rule is that it takes the left part of the data into account 
when defining a right outlier and vice versa. Using the asymmetric boxplot rule 
overcomes this problem. In so doing, we find that left-tailmass(abp) increases and 
right-tailmass(abp) decreases relative to their symmetrical alternatives, which can 
be seen as another indication of chronic right skewness. These asymmetrically con- 
structed tailmass measures confirm the tendency for the left-hand tail to increase 
and the more stationary behavior of the right-hand tail. 



4. Classical and Robust Measures of Location and Scale 

The classical and robust measures of location and scale are plotted in Figure 2. The 
robust alternatives were already used in the previous section to construct robust 
measures of tail weight. In this section we will compare them to their classical 
counterparts. 
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Figure 2. Time series of the location and scale measures. 
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4.1. Location 

As the use of robust measures of location for inflation data is well-developed in the 
core inflation literature (see for instance Aucremanne, 2000, for references), we only 
give some brief comments when comparing the mean to the median. In Table 1 the 
unit root test for the classical and robust measures of location, scale and skewness 
is given. Given the monthly frequency of the data, for each estimator 11 lags were 
necessary to produce residuals without autocorrelation. The first column gives the 
value of the augmented Dickey Fuller test statistic, while the second column reports 
the significance of the constant in the test equation only if the null hypothesis of a 
unit root is rejected. The 95% critical value for rejection of the null hypothesis of 
a unit root is —2.87. Inclusion of a trend in the test equations for those variables 
for which the null of a unit root was not rejected did not change the results. In 
other words, they were not trend stationary either. 

Table 1. Unit root tests for the classical and robust measures of 

location, scale and skewness. 



ADF statistic 


signif 




ADF statistic 


signif 


mean —1.74 




skew(class) 


-3.82 


0.12 


median —1.58 




dskew(class) 


-5.80 


0.02 






dskew(125) 


-2.55 




standard deviation —3.97 


0.00 


dskew(250) 


-3.44 


0.00 


IQR -1.74 




meme 


-2.59 





As the ADF-tests reported in Table 1 fail to reject the null hypothesis of a 
unit root for the classical and the robust measures of location, both time series 
contain a stochastic trend. The smoothed curves in Figure 2 seem to confirm 
this conclusion. The median is substantially less volatile around its trend than the 
mean, thus confirming that using a robust estimator of location yields a substantial 
gain in efficiency in the case of a fat-tailed distribution. This could motivate the 
use of the median as a measure of core inflation statistically. The first robust core 
inflation measure was indeed the median (Bryan and Pike, 1991). Subsequently, a 
wide range of alternative robust estimators of core inflation have been proposed. 

A cointegration equation of two non-stationary weighted time series X and 
y is a linear combination of both time series which is itself stationary (Verbeek, 
2000). As mean and median both are non-stationary, we can report the normalized 
cointegration equation (CE) in Table 2. The 95% critical value is 19.96 for rejection 
of the null hypothesis of no cointegration equation and 9.24 for rejection of the null 
hypothesis of at most 1 cointegration equation. The cointegration rank and the 
specification of the deterministic components of the model were determined jointly 
on the basis of the general test procedure discussed in Johansen (1992). However, 
models having trends in the levels of the variables (constants in the VAR) were 
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not considered. As can be seen from Table 2, the median is cointegrated with the 
mean, but with a coefficient (0.5934) that is substantially less than one. In Figure 2 
it can indeed be verified that the median tends to be lower than the mean. This 
is another indication of chronic right skewness in the Belgian inflation data. Note 
that the constant in the cointegration relation is very close to zero, suggesting that 
the chronic asymmetry would disappear completely at zero inflation. 

Table 2. Cointegration of the robust measure of location (me- 
dian), of the robust measures of skewness (dskew(125) and 
dskew(250)) and of the classical skewness measures (skew(class) 
and dskew(class)) with actual inflation (mean). Between brackets 
the standard errors are given. Lag number is 11. 



median dskew(125) dskew(250) skew(class) dskew(class) 

CE5 

Constant 0.0002 



(0.0200) 

Mean —0.5934 

(0.0563) 
Trace statistic: 

No CE 35.09 

At most 1 CE 4.70 



0.0412 


-0.0533 


(0.0566) 


(0.0209) 


-1.5037 


-0.2579 


(0.1597) 


(0.0589) 


30.86 


24.88 


3.41 


3.39 



-0.4711 


-11.0449 


(0.3879) 


(5.5618) 


-0.2853 


6.3588 


(1.0913) 


(15.7954) 


28.82 


40.62 


4.37 


3.79 



4.2. Scale 

Comparing the classical standard deviation with IQR indicates that relying on a 
robust alternative dramatically alters the characteristics of the measure of scale. 
According to the ADF-tests in Table 1, IQR contains a unit root whereas the 
classical measure appears as a stationary variable. Moreover, Figure 2 shows that 
the variability of IQR around its smoothed curve is substantially lower than the 
variability of the classical standard deviation. We owe this to the fact that the 
classical measure is dominated by the impact of outliers or, in economic terms, by 
large relative price shocks. In the absence of outliers, IQR would in contrast tend 
to be larger than the standard deviation (in the case of the normal distribution 
IQR 1.35s). Evidently, the impact of outliers on the classical measure of scale 
is magnified by squaring their deviations from the mean (see formula (3.1)). 



5. Classical and Robust Measures of Skewness 

Skewness reflects the shape of the distribution. A symmetric distribution has zero 
skewness, a distribution which is asymmetric with the largest tail to the right 
has a positive skewness, and a distribution with a longer left tail has a negative 
skewness. 
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5.1. Classical Measure of Skewness 

Normally, the standardized, unitless measures of asymmetry (skewness) are used. 
The best known measure of skewness is the classical skewness (Roger, 2000) 



skew(class) 



( 5 . 1 ) 



However, we will consider unstandardized measures of asymmetry as well. 
In this way skewness becomes dependent on scale, but this correlation does not 
bother us. Indeed, the economic models we rely on suggest that scale and skewness 
interact. In Ball and Mankiw (1995) it is shown that scale magnifies the effect of 
skewness on infiation, while the asymmetric range of inaction in Ball and Mankiw 
(1994), which is a possible source of right skewness in the data, is also unsealed. All 
unstandardized measures of asymmetry are expressed in some units, and the robust 
measures have the same units as the data and the measures of location and scale. 
The unstandardized classical skewness is defined as dskew(class) = s^skew(class). 



5.2. Robust Measures of Skewness 

A disadvantage of the classical measures of skewness is that they change dramat- 
ically when we introduce an outlier. This is clearly shown by the empirical study 
of Brys et al. (2003). On the other hand, skewness is supposed to measure the 
asymmetry of the observations, so we may not downweight outliers too heavily. 
The first robust alternative we propose is dskew(125), which is given by: 

dskew( 125 ) = (Qo.sts ~ Qo.so) - (Qo.so ~ Q0.125) ( 5 . 2 ) 

with Qp again the p% (weighted) percentile. The following measure is very similar, 
but it is based on other percentiles. We compute dskew(250) as: 

dskew(250) = (Q0.75 ~ Q0.50) ~ (Q0.50 ~ Q0.25) (5.3) 

Note that (5.2) and (5.3) are unstandardized versions of the class of skewness 
measures introduced by Hinkley (1975). 

Another measure of skewness can be obtained by standardizing the data as 
Zi = Xi — Qo.50 and then taking the first moment of these standardized observa- 
tions which corresponds to the mean-median difference, yielding meme = z. This 
measure is not robust, because it uses the mean. However, we consider it here as an 
alternative measure of skewness, as it has, compared to the classical measures, the 
advantage that the infiuence of what happens in the tails is far less accentuated. 



5.3. Chronic Right Skewness 

We have applied both types of measures of skewness on the cross-sections of the 
Belgian infiation data. The resulting time series are plotted in Figure 3. The clas- 
sical measures of skewness show a substantial degree of frequently switching left 
and right skewness, which is a typical result in the literature in regard to the anal- 
ysis of infiation data. Indeed, this property has been documented for several other 
countries (see Aucremanne et al., 2002, for references). However, they do not show 
a clear tendency towards positive values. As for scale, the time series properties of 
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the robust measures of skewness are fundamentally different from those of the clas- 
sical ones. As the ADF tests in Table 1 indicate, the null hypothesis of a unit root 
in the classical measures was rejected at the 99% significance level. Moreover, the 
constant term in the test equation was not different from zero at the conventional 
significance levels for skew(class). For dskew(class) the constant was significant at 
the 95% level, but not at the 99% level. Also, in this case the estimated value 
of the constant is small relative to the observed variability. Hence, the classical 
measures hardly reveal any form of chronic right skewness in a significant way, 
notwithstanding the fact that we provided evidence of a tendency towards right 
skewness. 

In contrast, the robust measures dskew(125) and dskew(250) show a more 
pronounced tendency towards positive values. Hence, they better reveal the chronic 
right skewness of the data. The mean-median difference meme also shows such a 
tendency, although it takes negative values more frequently than the two other 
robust measures. We owe the latter to the fact that meme is in fact not robust. 
The null hypothesis of a unit root is rejected for dskew(250) at the 95% signifi- 
cance level, suggesting mean reversion around a significant positive constant. In 
contrast, for dskew(125) and meme, the null of a unit root is not rejected, sug- 
gesting a time- varying degree of chronic right skewness. Particularly in the second 
half of the sample the tendency towards right skewness seems less pronounced, 
thus confirming the findings of Sections 3 and 4. A chronic tendency towards right 
skewness is also found in several other countries and some of these studies also 
mention that this tendency is less pronounced in the most recent period, when 
inflation is lower (see Aucremanne et ah, 2002, for references). 
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Figure 3. Time series of the skewness measures. 
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6. Inflation Afiecting Relative Price Changes 

In view of the chronic right skewness of the distribution of Belgian consumer price 
changes revealed in the previous section by the robust measures, it is important 
to stress that two alternative economic models can generate this phenomenon. 
The two models have however different, empirically testable implications for the 
relation between skewness and aggregate inflation. 

The first model relies on an exogenous form of downward nominal rigidity 
of prices of the Tobin- type (Tobin, 1972). This model predicts a negative relation 
between inflation and skewness: higher aggregate inflation leads to less right skew- 
ness (see Yates, 1998, on this issue). The second model relies on menu costs, i.e. 
it assumes that changing prices involves a (menu) cost, which is fixed in the sense 
that it does not depend on the magnitude of the price change. Examples of this 
type of costs are changes to price lists, catalogues, advertising, etc. This menu cost 
produces a range of inaction: Arms react only to large price shocks, for which the 
advantage of changing prices outweighs the menu cost. In Ball and Mankiw (1994) 
it is shown that this range of inaction becomes asymmetric in the presence of 
trend inflation. In other words, (positive) trend inflation reduces the likelihood of 
observing a downward adjustment of prices relative to the likelihood of an upward 
adjustment. Therefore, this model implies a positive relation between inflation and 
skewness: higher aggregate inflation leads to more right skewness. 

In view of this, the relation between aggregate inflation and skewness is stud- 
ied by cointegrating the robust measures of skewness with actual inflation. As 
is shown in Table 2, using the robust measure dskew(125) reveals a clear posi- 
tive cointegration relationship between asymmetry and inflation. Each percentage 
point of additional (monthly) inflation tends to increase dskew(125) by 1.5 per- 
centage points. Moreover, the constant term in the cointegration relation is very 
small both in statistical and in economic terms, suggesting that the chronic right 
skewness would disappear completely at zero inflation. As such, the long term 
behavior of the chronic right skewness measured by dskew(125) seems to be an 
integral part of the inflationary process itself. A similar conclusion was obtained 
on the basis of the cointegration relation between the median and the mean. 

Also for dskew(250), the Johansen cointegration test reveals the existence 
of a positive cointegration relation with inflation, notwithstanding the fact that 
the ADF-test reported in Table 1 suggests that dskew(250) is stationary. In such 
conflicting case the Johansen test dominates the ADF-test. This measure of asym- 
metry tends to increase by 0.26 percentage points for each percentage of (monthly) 
inflation. The constant in the cointegration relation for dskew(250) suggests the 
existence of some form of chronic right skewness which is independent of the infla- 
tionary process. The impact of this constant is, however, relatively small compared 
to the impact of inflation. Indeed, each 0.2% of additional monthly inflation (ap- 
proximately 2.4% on a yearly basis) increases dskew(250) by the same amount 
as that resulting from the constant. This suggests that a substantial part of the 
chronic asymmetry measured by dskew(250) disappears at zero inflation. 




A Study of Belgian Inflation 



23 



These findings thus tend to provide evidence in favor of menu cost models, 
while they strongly contradict the exogenously assumed downward nominal rigidity 
of Tobin (1972). Distinguishing between these two models is important from a 
monetary policy perspective, as they have different implications for the optimal 
rate of inflation. The menu cost favors a zero inflation rate in order to minimize 
the disturbing impact of aggregate inflation on the distribution of relative price 
changes, whereas the Tobin-model model argues in favor of a positive inflation rate, 
that is sufficiently high to overcome the inefficiencies resulting from the downward 
rigidity. 

A similar analysis which uses the classical measures of skewness did not allow 
to describe this long-run behavior of the chronic right skewness. Although with 
skew(class) the Johansen test suggests the existence of a cointegration relation, we 
do not interpret this as meaningful because of its coefficients which are estimated 
very imprecisely (large standard errors for both the coefficient of the mean and 
the constant). As a matter of fact, in this case the Johansen test exactly confirms 
the findings of the ADF-test as skew(class) is cointegrated with the constant zero 
(or, it is stationary around a non-significant constant). Hence this measure does 
not reveal any form of chronic right skewness in a significant way, nor a systematic 
long-run impact of the inflation rate on the asymmetry. A similar conclusion can be 
made with dskew(class), although the constant term is estimated somewhat more 
precisely. Both tests come to a similar conclusion, namely a stationary behavior 
around a positive constant which is (according to the ADF-test) statistically sig- 
nificant at the 95% level, but not at the 99% level. Nevertheless, this constant is 
small compared to the variability observed in the series of dskew(class), such that 
we may conclude that dskew(class) hardly reveals any chronic right skewness. 



7. Conclusions 

In this paper we have studied the properties of the distribution of Belgian con- 
sumer price changes, by using both classical and robust measures of location, scale, 
skewness and tail weight. The robust measures of tail weight showed clearly (i) 
the fat-tailed nature of the distribution, (ii) the short run volatility of the tail 
weights, (iii) the tendency for the right-hand tail to dominate the left-hand tail, 
which points towards chronic right skewness and (iv) the increase over time of the 
left-hand tail and the more stationary behavior of the right-hand tail. 

It was found that the chronic right skewness was cointegrated with aggregate 
inflation. Moreover, we found little evidence of chronic right skewness not depen- 
dent on the inflationary process, suggesting that the asymmetry would disappear 
at zero inflation. Overall, these results are in line with the predictions of menu cost 
models and they are symptomatic of nominal rigidities, in the sense that prices 
are adjusted infrequently. However, they do not point in the direction of specific 
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downward rigidities, other than those endogenously generated by the prevailing in- 
flation rate. These findings provide arguments in favor of a price stability-oriented 
monetary policy. 
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Robust Strategies for Quantitative 
Investment Management 

G. Bassett, G. Gerber and P. Rocco 



Abstract. We show how quantile estimation combined with robust methods 
can be used in quantitative investment management. A portfolio manager 
uses a quantitative model to select securities. The objective is to outperform 
a benchmark portfolio, subject to risk constraints. Traditional stock selection 
models express expected returns as a function of factors where all parts of the 
return distribution are affected similarly. This is subsumed by the quantile 
approach in which a stock’s entire return distribution is a conditional function 
of factors. Robust methods insure that our estimates do not depend on a small 
subset of the data. Regression quantile estimates then detect the potentially 
different impact of factors at the center and tails of the return distribution. 
This is illustrated in assessing a model’s forecasting accuracy while controlling 
for return diflFerences between economic sectors. We are thereby able to detect 
forecasting properties that would have been missed by a conventional analysis 
of the data. 

Mathematics Subject Classification (2000). Primary 91B28; Secondary 62G35. 

Keywords. Quantitative portfolio management, robust models, quantile re- 
gression. 



1. Introduction 

We illustrate how the combination of robust and quantile-based methods can im- 
prove quantitative investment management strategies. A portfolio manager uses 
a quantitative model to select securities in the context of a systematic and dis- 
ciplined approach to asset valuation and investing. The objective is to build and 
then periodically rebalance a portfolio so that, subject to risk constraints, it out- 
performs a benchmark portfolio. The risk constraints limit expected deviations 
between the portfolio and benchmark returns. Asset selection is based on a mul- 
tivariate model that exploits “anomalies” in which stocks’ risk-adjusted returns 
above a benchmark (residual returns) are associated with various characteristics 
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of particular stocks. These characteristics (e.g., P/E ratios and earnings growth 
rates) are termed “factors” . Among the best-known anomalies are those associated 
with value and small capitalization stocks, and market reaction to revised earnings 
estimates. (The literature in finance on such anomalies is quite vast. Extended dis- 
cussions are provided in Fama and French, 1992, DeBondt and Thaler, 1985, and 
Lakonishok et al., 1994.) 

Quantitative methods in investment management utilize very large data sets. 
Historical databases with daily returns and firm characteristics for thousands of 
companies going back many years are used to develop and backtest models. Up- 
dated data is used for diagnostics, factor attribution, and periodic revision of 
portfolio weights. The large amount of data means it is difficult and costly to 
eliminate mistakes or detect correct, but influential, observations. Mistakes arise 
because of transcription errors, changed ticker symbols, unrecorded stock splits, 
and so on. Extreme observations occur for the stocks of the formerly exuberant 
New Economy as well as more recently for companies using the latest accounting 
“innovations” . 

The consequences of data outliers are well known: only a few discrepant ob- 
servations have a huge impact on standard estimation methods. The methods are 
tricked because they are extremely sensitive to only a few deviant observations 
- they are not robust. Robust data analysis produces estimates that are deter- 
mined by the bulk of the data without undue influence by a few observations; 
see Rousseeuw (1984) and Koenker (1982). Knowing that estimates are based on 
the bulk of the data is advantageous for model development. It signals important 
determinants of returns that otherwise would be obscured by a few discrepant ob- 
servations. Conversely, factors deemed important because of only a few influential 
observations are revealed by robust estimates as lacking explanatory power for the 
bulk of the data. Hence, robustness is important for investment management be- 
cause of the existence of outliers in very large databases together with the lack of 
robustness of standard methods. The importance of robustness is magnified by the 
fact that quantitative models are often used to select a relatively limited number 
of stocks; mistakes can be costly. 

Investment valuation models are also characterized by very low signal-to-noise 
ratios. Finding factors that correctly signal even slight departures, on average, from 
market returns can be quite profitable even when departures from the average 
are large. The signal provides the basis for a portfolio manager’s ability to earn 
excess returns while the noise means there can be large deviations between actual 
and expected returns. Some of the deviation can be diversified away by exploiting 
independence and negative correlations. The sheer magnitude of the noise however 
means that portfolios with a large number of stocks still have high model risk. 
Further, signal strength varies over time thus requiring frequent rebalancing and 
hence large transaction costs for large portfolios. As a result of high noise, portfolio 
returns are subject to potentially large departures from what is predicted by the 
signal. This leads to consideration of how factors affect the tails of the return 
distribution. 




Robust Strategies for Investment Management 



29 



The traditional framework for the robust approach has been a data generating 
process with a few gross errors from a “true” primary model. The objective is to 
estimate the central tendency of the main model, without undue influence by a 
few outlying, discrepant, observations. In our case the primary model specifies the 
relation between expected returns and other variables. Features of the distribution 
at places other than the expected value, however, are of critical importance. Rather 
than “mistakes” the deviations are (or, at least can be) functions of the same 
factors that determine central tendency. Hence, we want to allow factors that 
determine central tendency to also affect the tails of the return distribution. Factors 
that may have no influence on expected returns cannot be ruled out as important 
in forecasting other parts of the return distribution. Alternatively, factors that do 
influence the center can have different effects in the tails. 

This leads to our data analysis that combines robust and quantile meth- 
ods. For an overview of quantile modelling, see Koenker and Hallock (2001). The 
combination is useful for quantitative investment because there are influential ob- 
servations in our large data sets, and because there are quantile effects in which 
models work differently in the tails and in the central region of the return distri- 
bution. The quantile approach therefore provides information about how factors 
affect the entire return distribution, not just central tendency. The standard model 
in which quantile effects are constant is nested in the quantile model. While the 
traditional approach has factors affecting the mean and other parts of the return 
distribution equally, the more comprehensive quantile approach allows differences 
to be detected. 

We have used this approach in the development, assessment, and calibration 
of investment strategies. In this paper we focus on how the quantile-based, robust 
approach can be used to assess potential modifications of a stock selection model. 
Section 2 explains how the stock selection model fits into the overall investment 
process. The selection model itself is estimated using robust methods. Section 3 
presents an assessment of predictive ability based on expected values, the usual 
criterion in the investment management field. It also indicates our enhancement in 
which the entire conditional return distribution is considered. Section 4 presents 
regression quantile estimates of the relation between actual and predicted returns 
where we control for sector effects. Section 5 discusses results and indicates further 
enhancements using the robust-quantile approach. 



2. Overview of the Investment Process 

The investment process can be divided into four interrelated stages. In the first 
stage, a portfolio manager identifies an investment universe from which individual 
stocks will be ranked and evaluated for inclusion in a portfolio. The investment uni- 
verse is typically designed to client specifications; it might consist of constituents 
in a broad-market index or stocks in a specific range of market values. The universe 
for our illustration consists of the largest 1000 stocks (by market value) in the U.S. 
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In the next stage, the manager identifies a benchmark portfolio of stocks 
(often an index). In the investment management industry the benchmark plays 
two roles. It sets the target rate of return that a portfolio seeks to exceed. For our 
illustration the benchmark is the S&P 500 so that the manager seeks to maximize 
returns relative to the S&P 500 index. 

The benchmark also serves as the reference point for defining the extent to 
which a portfolio can differ from the benchmark. A client will typically specify a 
constraint that limits deviations between portfolio (Pt) and benchmark {Bt) re- 
turns during period t. This constraint, known as “tracking error”, is defined as (an 
estimate of) the standard deviation of Pt~ Bf. It is denoted by TE{Pt, Bt) and is 
required to stay below a prespecified value. Tracking error is typically estimated 
using industry-standard risk models supplied by vendors. The benchmark there- 
fore sets the target rate against which performance is judged, and constrains the 
expected difference between Pt and Bt. 

Let Pt denote portfolio returns for period t where Pt = with 

U denoting the investment universe, Wi^t the weight on stock i at time t, and Ri^t 
the return to stock i during period t. The manager’s objective is to select weights 
Wi^t to maximize to the difference between Pt and Bt subject to TE{Pt^Bt) < x. 

The third stage in the process makes use of a selection model to determine 
the relative weight of stocks in the managed portfolio. Our analysis is based on a 
multi- factor model that assigns a so-called “alpha” -score, o;z,t+i, to each stock i 
which serves as a forecast of a stock’s relative residual return in month t + 1. A 
linear-model for the score at^t may be represented as, 

= bi^t-^iFi^i^t + + • • • + ( 2 - 1 ) 

where Fj^i^t is the exposure of factor j for stock i at the end of period t, and 
is the “weight” on factor j during period Typical factors might include well- 
known financial ratios (e.g., E/P), measures of earnings growth and dispersion, and 
past price movements. The model “weights”, are computed using historical 

data on factors and returns. Robust and quantile methods are used at this stage of 
the investment process to control for outliers, and assess quantile effects in the way 
factors influence returns; estimators include least median of squares, least absolute 
deviations, and regression quantiles; see Rousseeuw (1984), Bassett and Koenker 
(1978), and Koenker and Bassett (1978). 

The weights, bj^t-^-u are estimated based on data known at time t and then 
updated to reflect the success of the associated factors in forecasting returns. The 
a-values are generally in the range —2 to 2. Note that the range of a-values is 
limited because extreme factor values are trimmed or Winsorized, see Grinold and 
Kahn (2000, p. 382). For our assessment, we have a sequence of at^t for i=l to 1000 
and t each month from January 1991 to September 2002. The goal is to assess the 
ability of the model to discriminate between the stocks in the investment universe 
with respect to future returns. 

The final stage of the process consists of constructing a portfolio by maximiz- 
ing the “alpha” of the portfolio subject to the risk constraint associated with the 
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Figure 1. Average monthly residual returns grouped by a decile 
(Decile 10 is best). 



benchmark portfolio and turnover considerations. The process of buying and sell- 
ing of stocks to adjust the holdings in the managed portfolio entails costs. As such, 
there is a trade-off between capturing the signals from the valuation model and 
paying the turnover costs. Note that this process reflects a setting in which only 
purchases of stocks is permitted. The issues involved when purchases and sales of 
borrowed stocks (shorts) are permitted is more complex (see Gerber, 1996). 



3. Assessing Model Performance 

3.1. Conditional Expectation 

As indicated, our analysis is based on a’s generated for each of approximately 1000 
stocks for each month from January 1991 to September 2002. 

The weights on each factor, bj, vary over time depending on the recent per- 
formance of the factor. The method used to generate factor weights is similar to 
that employed in other stock selection models we have investigated. Backtests of 
models are typically based on the ability of a to forecast one-month ahead re- 
turns. For example. Figure 1 shows average residual monthly return associated 
with portfolios created by sorting stocks into deciles based on a-score. The graph 
shows stocks in the top a decile had an average monthly residual return of 1.1% 
while the stocks in the lowest decile had returns of —0.7% over the entire period. 

Gross returns (exclusive of transactions cost) from a strategy of buying the 
stocks in decile 10 and shorting those in decile 1 would have produced an average 
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monthly gain of 1.8%. The nice “step-ladder” pattern in the graph indicates the 
valuation model is able to discriminate among stocks based on future expected 
returns. While the a’s are suggestive of a profitable portfolio strategy, further 
analysis would need to consider transaction costs. Further, the returns pictured 
in Figure 1 do not account for the tracking error. The positive returns associated 
with a, for example, might be concentrated in only a few sectors and hence entail 
large tracking error. If the highest a’s tended to concentrate in, say. Technology, 
then a portfolio tilted toward large a would be heavily exposed to Technology and 
subject to large deviations from the overall market. 

3.2. Conditional Distribution 

A more complete picture of model performance that is motivated by our concern 
about the entire return distribution is depicted in Figure 2. It shows residual re- 
turns at each a decile. The average return in the previous figure is indicated by 
E(). The picture shows returns tend to increase with a decile, but it also shows a 
very wide distribution around the central tendency. Further, the conditional dis- 
tributions at each a decile are not just location shifts of one another. For example, 
we see that the Q(.9) part of the distribution tends to increase with a, but at 
low a deciles the function is actually decreasing. This means that the top end 
of the return distribution decreases in a move from a decile 1 to 2, contrary to 
what occurs in the center of the distribution. In a similar fashion, Q(.l) decreases 
at high a. Shifting a portfolio into higher a stocks increases average returns, but 
decreases returns in the lower tail. 



4. Estimating Conditional Quantile Effects 

In this section we present regression quantile estimates of the relation between a 
and one-month ahead actual excess returns. Our model controls for sector effects 
with sector dummy variables corresponding to the 15 economic sectors in the 1000 
stock universe. The model for returns is, 

Q{e\s,a)=as{e)S + P{e)a (4.1) 

where Q{0 \ s,a) is 6^^ quantile of the return distribution conditional on the 
stock’s sector and a score, S is an indicator for the stock’s sector, as is the sector 
coefficient, a is the stock’s “alpha”, and (3 (6) is the regression quantile coefficient. 

The model is estimated separately for o; > 0 and a < 0. This provides 
separate (3 {9) estimates for stocks predicted to outperform the market {a > 0) 
and those predicted to underperform the market (a < 0). 

Graphical versions of the estimates are presented in Figures 3 and 4. Quantile 
regression coefficients as a function of the 6^^ quantile for a > 0 (Figure 3) and 
a < 0 (Figure 4) are shown. The horizontal lines in Figure 3 depict the least 
squares estimate and confidence bounds. It shows a positive and constant relation 
between predicted and actual excess returns. The estimates indicate that for a > 0 
expected excess returns are an increasing function of a, after controlling for sector 




Robust Strategies for Investment Management 



33 



Q(.9J 




Figure 2. Quantile residual returns by a decile. 



effects. The figure also shows, however, there are quantile effects: after controlling 
for sector affects the impact of a on excess returns is not the same throughout 
the return distribution. Compared to the effect at the mean the impact of a on 
returns in the upper tail is even more powerful. At the lower quantiles the effect 
diminishes and at the lowest part of the distribution actually becomes negative. A 
negative coefficient means that as a increases the associated quantile of the return 
distribution decreases, contrary to how we would want the model to perform. 

The situation for a < 0 is shown in Figure 4. The coefficient at the expected 
value is positive but smaller than for a > 0. The expected average performance 
of the model is good for all a, but better for the stocks that are predicted to 
outperform the market (a > 0). But there are again quantile effects; high quantile 
parts of the distribution show a negative association between alpha and returns. 

Figure 5 presents the same results, but in “data space” . It shows returns as a 
function of a controlling for the separate sector intercepts included in the model. 
There is a positive relation between expected and forecast returns. There is still a 
low signal-to-noise ratio, which is evidenced by the wide range of conditional quan- 
tiles. Since we are controlling for sectors the low signal-to-noise evident in Figure 2 
is not due to return differences between sectors. The Figure 5 also shows the poor 
performance of a in the lower tails for a > 0, and in the upper tails for a < 0. The 
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Figure 3. Regression quantile and least squares, E(), estimates 
for residual returns for a > 0, controlling for sector effects. 



results raise the question of whether this poor performance might be improved by 
including additional explanatory variables in our quantile specification, a topic to 
be explored in further research. 



5. Discussion 

In spite of their advantages, robust methods have not been widely used in quanti- 
tative finance. Like other areas of applied statistics, the absence of robust methods 
has been due in part to the failure to recognize the extreme sensitivity of standard 
methods, or the belief that standard methods were already “optimal” or “best” . 
The optimality label however is based on assumptions that are only more or less 
appropriate for financial data. Indeed, some of the earliest questioning of standard 
assumptions came from analysis of financial data that showed too many outlying 
realizations relative to the standard normal distribution; see Mandelbrot (1965) 
and Fama (1963). The fact that data tended to have fatter tails relative to the 
center of the distribution meant not only that standard methods were suboptimal, 
but that they could be highly misleading relative to robust methods. 
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Figure 4. Regression quantile and least squares, E(), estimates 
for residual returns for a < 0, controlling for sector effects. 



Another reason for the absence of robust methods is a perception that prepro- 
cessing data to remove univariate outliers results in robust estimates. Univariate 
screening and trimming of data helps identify influential values of dependent and 
explanatory variables, but it does not assure robust estimates. In the multivariate 
setting with no outliers for any single variable, there can be large differences be- 
tween standard and robust estimates. Trimming extreme observations is motivated 
by robustness considerations, but it does not assure estimates that are determined 
by the bulk of the data. 

Statistical applications of quantile estimation in Finance have also not been 
used widely (but for a recent application, see Bassett and Chen, 1991). Data 
analysis has tended to focus on estimating central tendency with only secondary 
attention on determinants of tail events. Or models were restricted to simple loca- 
tion shifts as a function of explanatory variables. The low signal- to- noise inherent 
in financial models, together with increasing concern about risk, makes quantile 
methods particularly appealing in quantitative portfolio management. The quan- 
tile model allows the relation between a set of factors and tail events to be esti- 
mated directly. This allows the relation between predicted and actual returns to 
be assessed comprehensively. The application presented here has shown that the 
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Figure 5. Regression quantile predictions as a function of a dis- 
tribution, controlling for sector eflFects. 



Qf-scores work well for expected returns, but result in increased dispersion at ex- 
treme a values, a feature that would have been missed in a conventional analysis 
focusing only on conditional expectations. 
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An Adaptive Algorithm for 
Quantile Regression 

C. Chen 

Abstract. In this article, we introduce an algorithm to compute the regression 
quantile functions. This algorithm combines three algorithms - the simplex, 
interior point, and smoothing algorithm. The simplex and interior point al- 
gorithms come from the background of linear programming (Portnoy and 
Koenker, 1997). While the simplex method can handle small to middle sized 
data sets, the interior point method can handle large to huge data sets effi- 
ciently. The smoothing algorithm is specially designed for the L\ or quantile 
regression type of problems (Chen, 2002), and it outperforms the other two 
algorithms for fat data sets. Combining these three algorithms produces an 
algorithm, which is adaptive in the sense that it can intelligently detect the 
input data sets and select one of the three algorithms to efficiently compute 
the regression quantile functions. 

Mathematics Subject Classification (2000). Primary 62F35; Secondary 62J99. 

Keywords. Quantile regression, heteroscedasticity, finite smoothing algorithm, 
simplex, interior point, median regression, linear programming. 



1. Introduction 

Quantile regression, which was introduced by Koenker and Bassett (1978), esti- 
mates and conducts inference about conditional quantile functions. A real valued 
random variable may be characterized by its distribution function, 

F{y) = Prob {Y < y), (1.1) 

while for any 0 < r < 1, 

Q{t) = inf {y : F{y) > t} (1.2) 

is called the rth quantile of F. The median, Q{l/2), plays the central role. Like the 
distribution function, the quantile function provides a complete characterization 
of the random variable Y. 
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For a random sample {yi, - ■ ■ ,yn} of F, it is well known that the sample 
median is the solution of the optimization problem 

n 

min (1.3) 

The general rth sample quantile ^(r), which is an analogue of Q(r), may be 
formulated as the solution of the optimization problem 

n 

min ^pr{yi-0, (1-4) 

where pr{z) = z{r — I{z < 0)), 0 < r < 1. 

As estimating the unconditional mean, viewed as the minimizer, 

n 

A = argmin^gjj Y^{yi - yf (1.5) 

2=1 

can be extended to estimation of the linear conditional mean function ^^(ylx = 
x) = x'l3 by solving 

n 

A = argmin^g^P y](2/i - a;'/3)^ (1.6) 

2=1 

the linear conditional quantile function, Qt{t\x = x) = x'/3(r), can be estimated 
by solving 

n 

P{t) = argmin^gjjj. pr{yi - x'i(3). (1.7) 

2=1 

The median case, r = 1/2, which is equivalent to minimizing the sum of absolute 
values, is usually known as the Li regression. 

An LP Problem 

Let p = [y - Xf3]+, u = [X/3 - y] + , <p = [/3]+, and = [-0]+, where y = 
(yi, . . . , yn)^ A' = (xi, . . . , Xn)', and [z]^ is the nonnegative part of z. 

Let Dlar{/3) = Er=i \Vi ~ ^i(^\ ^Pr (^) = ElLi Pr{yi - a;'/3). For the Li 
problem, the simplex approach solves min6> Dlar{0) by the reformulation 

imn{e'/x + e'u\y = X(5 -\- (i- u, {//, v] G (1.8) 

where e denotes an n- vector of ones. 

Let B = [X -XI - I], 6 {(j)' if' p' z/')', and d = (O' O' e' e')' where 

o' = (0 0 ... 0)p. The reformulation presents a standard LP problem: 

(P) mind'e 

e 

B6 = y 

e>o. 
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This problem has the dual formulation 

{D) maxy'z 

d 

B'z < d. 



It can be simplified as 

msyi{y'z\X'z = 0, z G [-1, 1]’^}. 

By setting ry = + ^e, 6 = it is 

max{y/'r/|X'ry = 6, ry G [0, 1]’"}. (1.9) 

V 

For quantile regression min^ ^ pr{yi — x'-/3)^ a similar processing presents the 
dual formulation: 

max{y' z\X' z = (1 - T)X'e, 2 : G [0, 1]’^}. (1-10) 

z 



2. Simplex 

Since the early 1950’s it has been recognized that median regression (Li regression) 
can be formulated as linear programming problems and efficiently solved with some 
form of the simplex algorithm. An efficient version of the simplex algorithm was 
developed by Barrodale and Roberts (1974). This algorithm solves the primary 
LP problem (P) by two stages, which were developed according to the special 
structure of the coefficient matrix B. The first stage only picks the columns in X 
or —A as pivotal columns. The second stage only interchanges the columns in I 
or -/ as basic or nonbasic columns. The algorithm gets an optimal solution by 
executions of these two stages interactively. Also because of the special structure 
of P, only the main data matrix X is stored in the current memory. 

This special version of the simplex algorithm for median regression can be 
naturally implemented to compute quantile regression with any given quantile, 
even the whole quantile process (Koenker and d’Orey, 1993). 

Although the worst case for this simplex algorithm is regarded as compu- 
tationally highly demanding for large data sets, careful and proper coding of the 
algorithm still makes it suitable for data sets with less than 5000 observations and 
50 variables with the present popular hardware (e.g., IGHz CPU and 256MB of 
RAM). 



3. Interior Point 

To solve large to huge LP problems, alternative algorithms have been developed. 
Rather than moving from vertex to vertex around the outer surface of the con- 
straint set as dictated by the simplex, the interior point approach of Karmarkar 
(1984) solves a sequence of quadratic problems in which the relevant interior of 
the constraint set is approximated by an ellipsoid. The worst-case performance 
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of the interior point algorithm is demonstrated better than that of the simplex 
algorithm. 

The excellent paper by Portnoy and Koenker (1997) revives the enthusiasm 
of applying L\ regression or quantile regression to large or huge data sets by 
introducing the faster interior point algorithm into this area. 

There are many variations of interior point algorithms. The most popularly 
used algorithm for Li regression or quantile regression is the Primal-Dual with 
Predictor-Corrector algorithm. 

Let c = b = {1 — r)X'e, and A = X', the dual problem (1.10) with a 
general upper bound u is 



max{c'2;} 

subject to Az = b 

0 < z <u. 

To solve this LP problem, 0 < 2 : < is split into 2 : > 0 and z <u. Let v be primal 
slack so that z + v = and associate dual variables w with these constraints. 

The interior point solves the system of equations to satisfy the Karush-Kuhn- 
Tucker (KKT) conditions for optimality: 

(KKT) Az = b 

z -\-v = u 
A't s - w = c 
ZSe = 0 
VWe = Q 
z,s,v,w > 0, 

where Z = diag(z), (that is, Zij = Z{ if i = j, Zij = 0 otherwise), 5 = diag(s), 
W = diag(i(;), V = diag(t^). 

These are the conditions for feasibility, with the complementarity conditions 
ZSe = 0 and VWe = 0 added, c'z = 6't — u'w must occur at the optimum. 
Complementarity forces the optimal objectives of the primal and dual to be equal, 

^ ^opt — b ^opt ^ ^opt* 

The interior point algorithm works by iteratively using Newton’s method 
to find a direction (Az^, At^, As^, Ait;^) to move from the current solution 
(z^, toward a better solution. 

To do this, two steps are used. The first step is called an affine step, which 
solves a linear system using Newton’s method to find a direction (Az^^, 

As^^, , Ait;^^) to reduce the complementarity toward zero. The second step is 

called a centering step, which solves another linear system to determine a centering 
vector (AzJ, At^, As^, A?;J, Au;^) to further reduce the complementarity. The 
centering step may not reduce too much of the complementarity; however, it builds 
up the central path and makes substantial progress toward the optimum in the next 




An Adaptive Algorithm 



43 



iteration. With these two steps, then 

= [AZaff, ^taff, ^Saff, ^Vaff, ^Waff) 

+ {Azc, Ate, Asc, Avc, Awc) 

(^ fc +1 ^ ^ fe +1 ^ gfc +1 ^ k + 1 ^ 

+ a(Az^ A^^ As^ Aw^ Aw^) 

where, a is the step length assigned a value as large as possible but not so large 
that a is “too close” to zero. 

Although the Predictor-Corrector variant entails solving two linear systems 
instead of one, fewer iterations are usually required to reach the optimum. The 
additional overhead of calculating the second linear system is small, as the fac- 
torization of the (A0~^A') matrix has already been performed to solve the first 
linear system. Refer to Wright (1996) for more details about this algorithm. 



4. Smoothing 

Algorithms developed for a general LP problem may not fully deploy the prop- 
erties of the original Li or quantile regression and have their own shortcomings. 
For example, the interior point algorithm can only give the approximate solutions 
of the original problem and rounding has to been done if one requires the same 
accuracy as that of the simplex algorithm. For some problems, this rounding step 
requires some significant extra computing time. In this case, some heuristic ap- 
proaches demonstrate advantages on both speed and accuracy. One of them is the 
finite smoothing algorithm. 

The finite smoothing algorithm was used by Clark and Osborne (1986), Mad- 
sen and Nielsen (1993) for the Li regression. It can be naturally extended to com- 
pute regression quantiles. In the following we briefly describe this algorithm, refer 
to Chen (2002) for details. 

The non-differentiable 



^Pri0)^'^pr{yi-Xi(3) 

1=1 

can be approximated by the smooth function 

n 

Dj,rm = Y,HjArim 

i=l 
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where 






- 1 ) - 



See Figure 1 for a plot of H^ r and pr- 



1)^7 if t < (r — 1)7 

if (r - 1)7 <t<T'j 
if t > T7 




Figure 1. Objective Functions and pr- 

The function is determined by whether ri{( 3 ) < (r — 1)7, Vi{P) > t 7, 
or (r — 1)7 < ri{/ 3 ) < rj. These inequalities divide into subregions separated 
by the parallel hyperplanes Vi{l 3 ) = {r — 1)7 and ri{/ 3 ) = r^. The set of all such 
hyperplanes is denoted by B^: 

B-y = {(3 e RF\ 3 i : ri{( 3 ) = (r - 1)7 or rj(/ 3 ) = T7} 

Define the sign vector s-y(/ 3 ) = (si(/ 3 ), • • • , Sn{P))' by 

( -1 if ri(/ 3 ) < (r- 1)7 
Si = Si{! 3 ) = < 0 if (r - 1)7 < n{( 3 ) < r-f 
[1 if n{/ 3 ) > T7 

and introducing 

m = Wi{f3) = 1 - sf{l3) 

thus, 

= ^'rwiViA) 

+Si[\n{0) + J(1 - 2r)7 + Si{n{l3){T - - ^(1 - 2r + 2r^)7)] 

yielding 

Dj,r{l 3 ) = ^'yr'WjV 4 - v'{s)r + c(s) 

where is the diagonal n by n matrix with diagonal elements Wi{/3), v'{s) = 
(si((2r - l)si + l)/2, . . . , s„((2r - l)s„ + 1)/2), c(s) = E[i(l “ 2r)7Si - |s|(l - 
2 t + 2t 2)7], and r(/ 3 ) = (n (/?),... ,r„ (/?))'. 
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The gradient of D^y r is given by 

= -X'[]-W,{l3)r{l3) +v(s)] 
and for /? G R^\B^ the Hessian exists and is given by 

= ^x'W^i(3)X 

The gradient is a continuous function in whereas the Hessian is piecewise 
constant. 

s is called a j- feasible sign vector if there exists (3 G RF\B^ with s^{f3) = s. If 
s is 'y- feasible then Qs is defined as the quadratic, which is deduced from D^^riP) 
by inserting s instead of s^. Thus, for any j3 with = s, 

Q,(a) = i(a - f3)'D^yl(f3)(a - f3) + D\]l{l3){a - /?) + 

Clearly = Qs(o;) in the domain 

Cs = {a|s^(o;) = s} 

For each 7 > 0 and (3 G there is one or several corresponding quadratics Qs- 
If (3 ^ B^ then Qs is characterized by /? and 7, but for (3 e B^ the quadratic is 
not unique. Therefore, a reference 

{ 1 , 0 , s) 

determines the quadratic. Define 

(7, /?, s) be a feasible reference if 5 is a 'y- feasible sign vector with /3 e Cs^ and 
(7,^, s) be a solution reference if it is feasible and (3 minimizes D^,r- 

The smoothing algorithm for minimizing Dp^ is based on minimizing 
for a set of decreasing 7. For every new value of 7, information from the previous 
solution is utilized. Finally, when 7 is small enough, a solution can be found by 
the modified Newton-Ralphon algorithm. The algorithm is described as: 

find an initial solution reference (7,/J^,s) 

repeat 

decrease 7 

find a solution reference (7,^37, s) 

until 7 = 0 

/?o is the solution. 

The initial solution reference is found by letting f3j be the least squares solution 
and then choosing 7 and s appropriately. /?o can be found after finite iterations 
when 7 becomes small enough. 

A similar strategy by Madsen and Nielsen (1993) can be used to decrease 7. 
The computation involved is not significant comparing with the Newton-Ralphon 
step. 

For the Li and quantile regression, it turns out that the smoothing algorithm 
is very efficient and competitive, especially for fat data sets (e.g., ^ > .05). 
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5. Combining the Three Algorithms 

Each of the previous three algorithms we described has its own advantages. None of 
them can dominate others. Based on this fact we construct an adaptive algorithm 
by combining these three algorithms. In the adaptive algorithm, we select one 
of the three algorithms according to our experience through a large number of 
simulations for various data sets. 

If n < 5000 and p < 50, use simplex 

p 

else if — > .05, use smoothing 
n 

else use interior point 



CPU time for p/n .05 




P 



ALGORITHM ♦ * ♦ IP * * * SMOOTH 



Figure 2. CPU Time for ^ = .05. 

n 



Comparisons between the simplex algorithm and the interior point algorithm 
have been thoroughly done in Portnoy and Koenker (1997). Here we do some simple 
comparisons between the interior point algorithm and the smoothing algorithm. 
More details can be found in Chen (2002). 

A data set is called fat if the ratio between the number of variables p and 
the number of observations n exceeds certain percentage. Here our empirical value 
for this percentage is 5%. With these kind of data sets, the smoothing algorithm 
out-performances the interior point algorithm in average. The main reason is the 
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CPU time for p/n = J 




P 



ALGORITHM • m 9 |p * * * SMOOTH 



Figure 3. CPU Time for ^ = .1. 

n 

smoothing algorithm finds the optimum solution by fewer iterations and that con- 
tributes fewer QR decompositions. This saves computing time, especially for the 
fat data sets. 

To support this argument, we carried two experiments with the Li regression. 
The first one has a fixed ratio ^ = .05 and the second one has ^ = .1. The sizes of 

n n 

the data sets run from 500 to 5000 by 500. The variables are randomly generated 
normal variables. The computing was carried out by a Dell PC with 1.33GHz CPU 
and 512MB RAM. The CPU time (in seconds) is displayed on the two plots. 

For data set with large n and small p, like ^ < .001, smoothing algorithm 
can be faster by using a fast threshold decreasing procedure. The difference in 
computing time of the two algorithms is not as significant as for the fat data set. 
For data set with .001 < ^ < .01, our test cases show that these two algorithms 
are competitive. 

It should be pointed out that the interior point algorithm here does not in- 
clude the preprocessing described in Portnoy and Koenker (1997). The smoothing 
algorithm can also implement the preprocessing. The effect of the preprocessing 
to the smoothing algorithm is under investigation. 
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On Properties of Support Vector Machines for 
Pattern Recognition in Finite Samples 

A. Christmann 

Abstract. The support vector machine proposed by Vapnik belongs to a class 
of modern statistical learning methods based on convex risk minimization. 

Other special cases are AdaBoost, kernel logistic regression and least squares. 

The support vector machine has the advantage that it usually leads to a 
reduction of complexity, because only the support vectors and not all obser- 
vations contribute to the prediction of a new response. This paper addresses 
robustness properties of the support vector machine for pattern recognition 
in finite samples. Sensitivity curves in the sense of J.W. Tukey are used to 
investigate the possible impact of a single data point. 

Mathematics Subject Classification (2000). Primary 62G35, 62H30; Secondary 
62G05, 47N30. 

Keywords. Convex risk minimization, pattern recognition, reduction of com- 
plexity, robustness, sensitivity function, statistical learning methods, support 
vector machine. 



1. Introduction 

In statistical machine learning for pattern recognition two major goals are the esti- 
mation of a functional relationship y ^ f{x) between an outcome y and predictors 
X = (xi, . . . , xa:)' G and the prediction of an unobserved outcome 2/new based 
on an observed value Xnew of predictors. The function / is unknown. One needs 
the implicit assumption that the relationship between Xnew and 2/new is — at least 
almost — the same as in the training data set {xi^yi)^ i = 1, . . . ,n. Otherwise, 
it is useless to extract knowledge on / from the training data set. The classical 
assumption in machine learning is, that the training data (x, y) are independent 
and identically generated from an underlying unknown distribution P for random 

The financial support of the Deutsche Forschungsgemeinschaft (SFB 475, “Reduction of com- 
plexity in multivariate data structures”) and of the Forschungsband DoMuS (“Modelling and 
Simulation”) are gratefully acknowledged. 
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variables (X, Y). In practical applications the training data set is often quite large, 
high dimensional and complex. The quality of the predictor f{x) is measured by 
some loss function L{y, f{x)). The goal is to find a predictor f{x) which minimizes 
the expected loss of /: 

i?L,p(/,6)-EpL(y,/(X) + 6). (1.1) 

In this paper we are interested in binary classification, where y G {— 1,+1}. The 
classification error is given by /(y,/(x)) = l{yf{x) < 0) + ^f{x) = 0)I(i/ = — 1), 
where I denotes the indicator function. Based on the law of large numbers one 
might estimate / by the minimizer of the empirical classification error 

ly^7(2/,,/(xi) + 6). (1.2) 

i—\ 

Unfortunately, the classification function I is not convex and the minimization of 
(1.2) is NP-hard ( Hoffgen et al., 1995). To circumvent this problem, one minimizes 
a convex upper bound of the classification error function /(y,/) (Scholkopf and 
Smola, 2002; Vapnik, 1998). If L : X R is an appropriate convex function, one 
considers the (approximate) minimization of the empirical risk (Steinwart, 2002): 

{fn,x,bn,\) = arg^^min^^ A||/|||f + ^ ^ HViJixi) + b), (1.3) 

2=1 

where A is a small regularization parameter and H is a. reproducing kernel Hilbert 
space (RKHS) of a kernel k. Often A is expressed via a positive penalizing constant 
(7 by A = (2Cn)“^. Let $ : X — > if be the feature map of if, i.e. $(:r) := 
fc(x, .). Hence, f{x) = $(x)'0 + b. Classical kernels k are linear kernels and the 
(exponential) radial basis function kernel fc(x,x') = exp(— 7||x — x'|p) , 7 > 0. 
The prediction of / for the label (x,y) is sign(/(x) + b). This problem can be 
interpreted as a stochastic approximation of the minimization of the theoretical 
regularized risk given in (1.4) (Vapnik, 1998; Zhang, 2001; Steinwart, 2002): 

(/p,a,6p.a) = arg min X\\f\\l + EfL{YJ{X) + b). (1.4) 

j G.n , oGM 

The objective function in (1.4) is denoted by x 6* solution of 

(1.4) is unique for convex loss functions. This is one reason why convexity plays 
a prominent role in classical SVM theory. This is in contrast to robust statistics, 
where often non-convex loss functions are used, of course with two important 
exceptions: LI and Huber’s loss function. Suykens et al. (2002) investigate weighted 
least squares support vector machines for regression problems in order to deal with 
non-convex loss functions as well. 

In this paper we will consider the support vector machine (SVM) which uses 
the loss function L{y^f{x) b) = max{l — y{f{x) -f 6),0}. The support vector 
machine has the inherent property that it leads to a reduction of complexity, as 
usually not all observations but only the support vectors contribute to the predic- 
tions. Other special cases are AdaBoost, kernel logistic regression, least squares 
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and modified least squares. Zhang (2001) shows for many convex loss functions that 
the classifiers based on (1.3) are universally consistent if 0 and oo, 

i.e. the classification error of fn,x{-) converges to the optimal Bayes error in prob- 
ability. Steinwart (2002) characterizes the loss functions which lead to universally 
consistent classifiers and establishes universal consistency for classifiers based on 
(1.3). Furthermore, he shows that there exist solutions of the minimization prob- 
lems of the theoretical and of the empirical problems. Steinwart (2002) also gives 
lower asymptotical bounds on the number of support vectors, i.e., on the data 
points with non- vanishing coefficients, and investigates the asymptotic behavior of 
fn,x{-) in terms of the loss function L. The RKHS part of the solutions of (1.4) are 
unique. Scholkopf and Smola (2002) describe other support vector machines and 
give an overview on algorithms to solve the minimization problems corresponding 
to SVMs. 

Obviously, the proof that many classifiers based on convex loss functions are 
universally consistent under weak conditions is a strong argument in favor of these 
statistical learning methods. Nevertheless, it is important to investigate robust- 
ness properties for such statistical learning methods for the following reasons. In 
practice one has to apply the methods to data sets with a finite sample size which 
contain a positive percentage of gross errors. Hampel et al. (1986, p. 27) note that 
1% to 10% of gross errors in routine data seem to be more the rule than the ex- 
ception. The data quality is sometimes far from being optimal, especially in large 
data mining problems (Hipp et al., 2001). Prom our point of view it is therefore 
important to investigate the impact a small amount of contamination of the ‘true’ 
probability measure P can have on the support vector machine. 

The rest of the paper is organized as follows. Section 2 gives the definition of 
the sensitivity curve, which is the robustness concept we are dealing with. Section 3 
contains the results and Section 4 contains the conclusion. 



2. Sensitivity Curve 

The sensitivity curve proposed by J.W. Tukey has the interpretation, that it mea- 
sures the impact of just one additional data point z on the empirical quantity of 
interest. 

Defibaition 2.1. Sensitivity curve. The sensitivity curve of an estimator Tn at a 
point z given a data set zi, . . . , Zn-i is defined by 

SCn(z;Tn) = n{Tn{zi, Zn-l,z) - T„_i(2:i, . . . , 2„_i)) . (2.1) 

The Dirac distribution at the point 2 : is denoted by A^. If the estimator Tn is 
defined via T(Pn), where Pn denotes the empirical distribution of the data points 
2 : 1 , . . . , then it holds for Sn = l/'u: 

cri ^ \ ~ '^n)Pn-l + ^n^z) ” T(Pn-l) 

n) — 

Sn 



( 2 . 2 ) 
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For many estimators it holds, that the sensitivity curve converges to the influence 
function, as n tends to infinity. Counterexamples are given, e.g., in Davies (1993). 
The sensitivity curve can be interpreted as a finite sample version of the influence 
function, see Hampel et al. (1986, p. 93). 

Here, we like to investigate robustness properties of the SVM for finite sam- 
ples. In the next section, the sensitivity curve is used to study the impact of a 
single data point on the results of the SVM. For an investigation of the influence 
function of certain statistical learning methods based on convex risk minimization 
methods, e.g. the support vector machine, kernel logistic regression, AdaBoost, 
and least squares, see Christmann and Steinwart (2003). 

3. Results 

We study the impact an additional data point can have on the SVM for pattern 
recognition on 

• the estimation of /, 

• the fitted value of y, 

• the empirical misclassification error, 

• the estimated parameter 0 , and 

• on the prediction areas of y. 

First, we consider the well-known mixture data having n = 200 data points 
given in Hastie et al. (2001), see Figure 1. The computations were done using 
the software SVM^^^^^ developed by Joachims (1999). We consider an exponential 
radial basis function (RBF) kernel $(x, x') = exp(- 7 ||x-x'|p). Appropriate values 
for 7 and for the penalizing constant C are important for the SVM and are often 
determined by cross validation (Scholkopf and Smola, 2002 ). A cross validation 
based on the leave-one-out error for the training data set was carried out by a 
two-dimensional grid search on 7 E {0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 5, 10, 20} and 
C G (0.5, 0.75, 1, 1.25, 1.5, 1.75, 2 , 5, 10 , 20 }. As a result of the cross validation, the 
tuning parameters for the SVM with RBF kernel were set to 7 = 3 and (7 = 1. 
For n = 200 this results in A = (2Cn)~^ = 0.0025. 

Figure 2 shows the sensitivity curve of /, if we add a single point z = (x, y) 
to the original data set, where xi = 2, X 2 = -2, and y = -hi. The additional 
data point has only a local and smooth impact on /, if one uses the RBF kernel. 
As was to be expected, the local impact near by z on / is relatively large (but 
finite) because A is very small. The outlying point would have an impact in a 
broader region for higher values of A. From Figure 3 where plots of / are given for 
the original data set and for the modified data set, which contains the additional 
data point z, we see that the RBF kernel yields / approximately equal to zero 
outside a central region, as almost all data points are lying inside the central 
region. Comparing the plot of / for the modified data set with the corresponding 
plot for the original data set, it is obvious that the additional smooth peak is due 
to the new data point located at x = ( 2 , — 2 ) with y = -hi. 
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)t1 

Figure 1. The mixture data set (Hastie et al, 2001). Left: data 
points. Right: fitted values. Legend: o for y = -1, • for y = +1. 




Figure 2. Left: Sensitivity function of /, if the additional data 
point z is located at 2 ; = (^,y), where x = (2, —2) and y = +1. 

Right: identical figure, but the z— axis has been truncated by 30 
to increase the visibility. 

Now, we study the impact of an additional data point 2 : = (:r,y), where 
y — +1, on the percent of classification errors and on the fitted y— value for 2 :. 
We vary 2 ; over a grid in the a:— coordinates. Figure 4 shows that the percent of 
misclassification errors is approximately constant outside the central region that 
contains almost all data points. The response of the additional data point was 
correctly estimated by y = +1 outside the central region. 

Next, we study the impact of an additional data point located at 2 : = (x, y) 
with y = +1 on the estimated parameters 6 and 6, see Figure 5. We vary 2 : 






54 



A. Christmann 





Figure 3. Plot of /. Left: RBF kernel, original data set. Right: 
RFB kernel, modified data set. The modified data set contains the 
additional data point z = (a:, y), where x = (2, —2) and y = +1. 




Figure 4. Left: Percent of classification errors if one data point 
z = {x^y = 1) is added to the original data set, where x varies 
over the grid. Right: Fitted y— value for new observation if one 
data point z = (x, y = 1) is added to the original data set, where 
X varies over the grid. 



over a grid in the x— coordinates in the same manner as before. The sensitivity 
curves for the slopes estimated by the SVM with a RBF kernel are similar to an 
affine hyperplane outside the central region, which contains almost all data points. 
However, this does not mean, that the predictions of outlying points are bad, see 
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Figure 5. Left: Sensitivity curve for the intercept b. Right: Sen- 
sitivity curve for 02. The sensitivity curve for 6i is not shown here, 
as it looks very similar. 



Figure 4. In the central region, there is a smooth transition between regions with 
higher sensitivity values and regions with lower sensitivity values. 

For studying the effect a single data point can have on the prediction areas 
of the response y depending on the penalizing constant C, which affects A, we 
generated a data set with n = 500 data points Xi from a bivariate Student’s ts 
distribution with location parameter /i = (0, 0) and scatter matrix E. The diagonal 
elements of E were set to 1, whereas the off-diagonal elements were set to 0.25. 
The responses yi were generated from a classical logistic regression model with 
intercept for the parameter vector 6 = (-1, 1) and 6=1, such that P(Yi = -|-1) = 
[1 + exp(-[6 + x'0])]-i and P(T, = -1) = 1 - P{Yi = +1). 

The upper part of Figure 6 shows that the choice of the penalizing constant C 
can be quite important for making predictions based on a support vector machine 
with an RBF kernel. This holds true especially, if predictions are made for a 
response with an x- value outside the bulk of the a:- values of the training data set. 
In this sense, the SVM can produce unstable predictions and should be used with 
some care, if the value of C is not appropriate for the present data set. In the lower 
part of Figure 6, corresponding prediction areas are given if two of the data points 
are moderate outliers. The SVM with an RBF kernel is able to accommodate 
outliers due to the consistency property. 

Similar sensitivity analyzes were done for some other situations: increased 
sample size, situations where a complete separation of both response groups is 
possible, and for a multivariate normal distribution instead of a multivariate Stu- 
dent distribution to generate the x-values. The results are not given here, because 
the results were qualitatively similar to those described in this section. 
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kernel=RBF, cost=1000 



kernel=RBF, cost=1 
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Xl 
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x1 




Figure 6. Prediction areas from a classical logistic regression 
model. Upper: simulated data set. Lower: simulated data set with 
two moderate outliers located at xa = (8, —8) with yA = 1 and 
xb = (—8,8) with yB — —1. Legend: ior y = 1, o for y = — 1. 

The prediction area for y = 1 is shaded. 
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4. Conclusion 

In this paper, we used J.W. Tukey’s sensitivity curve to study the impact of a 
data point on the support vector machine which is a popular statistical learning 
method based on convex risk minimization methods. Our results show that the use 
of an exponential radial basis function kernel leads to a smooth sensitivity curve 
for f which may have a peak around the additional data point if the response is 
unprobable. Varying the position of one additional data point had a smooth and 
local impact on / and on the estimated parameters 6 and 6 , if one uses an RBF 
kernel. Of course, the height of the peak depends on the regularizing parameter 
A and on the tuning parameter 7 used by the kernel. Sensitivity curves for / for 
a support vector machine with a linear kernel were not shown here, but indicate 
that the impact of varying one additional data point seems to be more globally 
than locally. 

An investigation concerning existence and uniform bounds with respect to 
P and 2: of the influence function of certain statistical learning methods based on 
convex risk minimization methods is given in Christmann and Steinwart (2003). 
Support vector machine and kernel logistic regression are treated as special cases. 
Suykens et al. (2002) investigate robustness properties of weighted least squares 
support vector machines for regression problems. 

For a numerical comparison between the support vector machine with a linear 
kernel and the regression depth method recently proposed by Rousseeuw and Hu- 
bert (1999) see Christmann and Rousseeuw (2001) and Christmann et al. (2002). 
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Smoothed Local L-Estimation 
With an Application 

P. Cizek 



Abstract. The Nadaraya- Watson regression estimator is known to be highly 
sensitive to the presence of outliers in the sample. A possible robustifica- 
tion consists in using local L-estimates of regression. Whereas the local L- 
estimation is traditionally done using the empirical conditional distribution 
function, Tamine et al. (2003) have recently proposed to use a smoothed 
conditional distribution function instead. This work studies computational 
aspects and small-sample properties of the smoothed L-estimation approach. 
The smoothed nonparametric L-estimator is applied to the estimation of the 
so-called implied volatilities, which describe the conditional variance of high- 
frequency financial time series (such as exchange rates or stock prices) inferred 
from the prices of related financial derivatives. 

Mathematics Subject Classification (2000). Primary 62G08; Secondary 62G35. 
Keywords. Nonparametric regression, L-estimation, smoothed cumulative dis- 
tribution function. 



1. Introduction 

One of the most widely used nonparametric regression estimators is the Nadaraya- 
Watson estimator by Nadaraya (1964) and Watson (1964). This estimator, being a 
local average of the response variable, is highly sensitive to the presence of outliers 
in the data: outliers do not only increase the variance of the estimator, but can 
also create fictitious peaks and structures in the estimated function. There have 
been many attempts to make the Nadaraya- Wat son estimator and its extensions 
more robust. Many of them rely on local M-estimation, see for example Hardle 
and Tsybakov (1988), Fan and Jiang (1999), and Beran et al. (2001). On the other 
hand, Boente and Praiman (1994) proposed to use a local L-estimate such as 
the local a-trimmed mean instead of a locally weighted average. Their procedure 

The research was supported by grant GA CR number 402/03/0084. 
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consists in using an empirical conditional distribution function to estimate the 
conditional quantiles and to determine data to trim. 

The median smoothing and local Li-estimation are special cases of local 
L-estimation. Since they are easy to implement and their performance and ro- 
bustness in nonparametric regression are quite good (see Cizek and Hardle, 2003, 
for instance), the use of local L-estimates may be a quite appealing alternative. 
Unfortunately, the use of the empirical distribution function, which is a step func- 
tion by definition, increases the variance of the estimates, and additionally, the 
computational demands grow significantly with the sample size. To remedy this, 
Tamine et al. (2003) proposed to use a smoothed cumulative distribution function 
instead of the empirical one (see Fernholz, 1997, for theoretical arguments in the 
case of non-conditional L-estimates). 

In Section 2, after introducing some basic concepts and results, we propose an 
algorithm for the smoothed local L-estimation. Section 3 covers its finite-sample 
properties, whereas Section 4 presents an application to the estimation of implied 
volatility surfaces. 



2. Local L-Estimation 



Boente and Fraiman (1994) defined the conditional regression L-estimate of Y on 
X at some x e by 

rriL {x) = j yJ (y)} dF^{y), (2.1) 



where the L-score function J is continuously differentiable a.e. on a compact sup- 
port [a, 6] C (0,1) and Fx{y) is the conditional distribution function of Y con- 
ditional on X = X. This definition covers for example the a-trimmed mean and 
median: J{z) = {1 — 2a)“^/(o;;i-a> (^)? where a G (0,1/2) represents the amount 
of trimming (median corresponds to a = 1/2). 

Given a sample {Xi,Yi)'^^i, Boente and Fraiman (1994) proposed to estimate 
the unknown conditional distribution function Fx{y) by the empirical conditional 
distribution function 



i=l 




( 2 . 2 ) 



where K is n p-dimensional kernel function, hx is a bandwidth, and / represents an 
indicator function. Plugging it in the definition (2.1) of the conditional L-estimator, 
one obtains the local empirical L-estimator 



rhiix) 




(2.3) 
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Apart from Fx{y) being discontinuous, using mii^) is also computationally de- 
manding for large samples, since the number of operations needed to evaluate 
rhiix) at one point x grows as O(n^) (every element of the sum requires estimat- 
ing 

Therefore, motivated by studies on smoothed conditional quantile estima- 
tors by Mehra et al. (1991) and smoothed unconditional L-estimators by Fernholz 
(1997), Tamine et al. (2003) proposed to estimate Fx{y) by a smoothed conditional 
distribution function 



fAv) = Y1 

i=l 




y-Yi 



(2.4) 



where Kj is a univariate cumulative distribution function (the integral of a kernel) 
and hy is another bandwidth, and they proved its asymptotic normality under 
mild ^-mixing conditions. This estimator inherits the regularity properties of the 
univariate kernel Kj and is typically differentiable. Moreover, the function rriL{x) 
can then be estimated by the local smoothed L-estimator 

miix) = Jyj{F,{y)]dFM (2-5) 

whereby the integral is approximated using classical numerical integration rou- 
tines. Given a conditional distribution function, we typically need for numerical 
integration with a prescribed precision a grid with a fixed number k of points, 
and hence, Fx{y) has to be evaluated only at a fixed and finite number of points. 
Tamine et al. (2003) presented a detailed analysis of computational demands for 
both the empirical (O(n^) operations) and smoothed local L-estimation (O(fcn) op- 
erations) . Consequently, the smoothed local L-estimation is faster to compute and 
should exhibit better small-sample properties because of the continuity of Fx{y). 
On the other hand, the asymptotic distribution of the empirical and smoothed 
L-estimates are equivalent (Tamine et al., 2003). 

In the following Section 3, we will compare in more details the finite-sample 
performance of the empirical and smoothed local L-estimates and other existing 
nonparametric smoothers. Before doing so, note that the integration procedure 
used here differs from the one used by Tamine et al. (2003), who used an equidistant 
integration grid defined on (maxiL^ + | max^ 17 1 , min^ 17 — |minil7|). Since this 
requires a relatively large number of grid points to achieve a good approximation 
of (2.5), we propose to use such an equidistant grid with a step c only between 
Si = max\Xi-x\<ha, ii = mm\Xi-x\<hx ^hat is only between conditional 

minima and maxima of Yi. Outside of this range, the grid points are defined by 
s/ + l/(fcc) and i/-l/(fcc) for fc = 1, . . . , [1/c^J . Thus, the whole real line is covered 
and the grid is most dense in the regions, where Fx{y) changes most. 
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3. Finite-Sample Comparison 

In this section, we attempt to compare the finite sample properties of the empir- 
ical and smoothed local L-estimators with an a-trimming L-score function using 
Monte Carlo simulations. Relying on arguments established by Fernholz (1997) 
in a non-conditional setting, the smoothed local L-estimator should have better 
finite-sample properties than the empirical one. This is probably surprising since 
an additional smoothing is used to estimate which may cause an additional 
bias. Nevertheless, it is accompanied by a decrease in the variance of the estimator 
so that the mean square error of smoothed estimates actually decreases compared 
to the empirical ones. Besides the local L-estimators with trimming varying from 
10% to 50%, we include for comparison the classical Nadar aya- Wat son estimator. 

Table 1. Comparison of the Nadaraya- Watson, the empirical (L- 
empir) and smoothed (L-smooth) local L-estimators by the mean 
square error under normal errors. All mean square errors are mul- 
tiplied by 10^. 



Method 

Grid 


Nadaraya 

Watson 


L-empir 


L-smooth 
{hy = 0.15) 


L-smooth 
{hy - 0.25) 


0.05 


2.52 


2.15 


2.65 


2.61 


0.10 


0.73 


1.00 


0.77 


0.74 


0.15 


0.68 


0.94 


0.66 


0.66 


0.20 


0.58 


0.82 


0.56 


0.56 


0.25 


0.51 


0.71 


0.50 


0.50 


0.30 


0.48 


0.69 


0.48 


0.48 


0.35 


0.45 


0.64 


0.46 


0.45 


0.40 


0.38 


0.55 


0.38 


0.38 


0.45 


0.35 


0.52 


0.35 


0.35 


0.50 


0.38 


0.55 


0.38 


0.37 


0.55 


0.42 


0.62 


0.42 


0.41 


0.60 


0.46 


0.69 


0.46 


0.46 


0.65 


0.50 


0.76 


0.52 


0.50 


0.70 


0.50 


0.75 


0.50 


0.49 


0.75 


0.46 


0.71 


0.49 


0.47 


0.80 


0.46 


0.75 


0.49 


0.47 


0.85 


0.51 


0.86 


0.58 


0.54 


0.90 


0.83 


0.96 


0.89 


0.86 


0.95 


2.59 


2.01 


2.60 


2.63 



Simulations were performed for a univariate data generating process Y = 
m(x)-j-e, where the regression function is given by rriL{x) = -l-\-y/x-x‘^ and the 
regressor x is uniformly distributed on interval (0, 1). The error term s comes from 
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Gaussian, heavy tailed, and contaminated distributions. The results presented here 
are based on 500 repeated simulations with samples of 100 observations, although 
we also examined smaller and larger sample sizes, getting the same qualitative 
results. All simulations were performed using XploRe. 

3.1. Clean Data 

In this subsection, we concentrate on the behavior of estimators in models where 
the error e comes from the same distribution for all observations. The performance 
of all estimators is compared for the following error distributions: normal AT(0, 0.1), 
Student and exponential Exp{(Je), where at = 0.57 and ae = 0.71 are chosen 
so that the variance of errors is always equal to 0.1. For the sake of brevity, we 
report results only for sample size n = 100, Gaussian kernel and for bandwidth 
hx = 0.075, which is the MISE-minimizing bandwidth for the Nadar aya- Watson 
estimator on the interval (0.07, 0.93) used for graphical comparison of estimators 
by MSE (see Figures 1-3). The simulations using larger sample sizes and other 
bandwidth choices produced qualitatively similar results. The methods for choos- 
ing hx and hy in practice are discussed in Subsection 3.3. 



Mean Square Errors 




X 

(e ^ N, a = 0.25, c = 2a) 



Mean Square Errors 




11.4 O.fr U.S 

X 



(e ^ Q = 0.5) c = cr) 



Figure 1. Mean square errors of the Nadar aya- Watson (dotted 
line), empirical (thick solid line) and smoothed (solid lines with 
crosses for hy = 0.0b, circles for hy = 0.10, triangles ior hy =0.15, 
and squares for hy = 0.20) local L-estimators. The trimming level 
of L-estimator is denoted a. 

A typical set of results is presented in Table 1, where simulations for e ~ 
^"(0,0.1) are summarized. There are several interesting points to observe even in 
this simple setting. First, notice that the smoothed L-estimator performs better 
than the empirical one at all, but boundary points. This is very typical indeed: 
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in the inner points of the interval, the bias introduced by additional smoothing 
of a distribution function is much smaller than the decrease in variance caused 
by the same smoothing and thus the smoothed L-estimator performs better; at 
the boundaries, where less observations are available, it is vice versa. Second, the 
smoothed L-estimator performs almost as well as the Nadar aya- Wat son estimator. 
The difference between these two estimators is much smaller than in the same 
simulations done by Tamine et al. (2003), which can be attributed to the improved 
integration procedure introduced in Section 2. 

To enable a more extensive comparison, the rest of the results are compared 
graphically using plots of the mean square errors of various estimators at the grid 
used in Table 1. For every simulation, there is a pair of graphs — one containing 
the empirical and smoothed L-estimates with 25% trimming, the other containing 
the L-estimates with 50% trimming; both cases always include the Nadaraya- 
Watson estimates for reference and in both cases, the smoothing bandwidth hy 
ranges from 0.05 to 0.20. Figure 1 shows once again the mean square errors of all 
estimators under normally distributed errors. Notice that, even for 50% trimming, 
the smoothed L-estimator is still as good as the Nadar aya- Watson estimator, at 
least for hy > 0.10 (for values hy close to 0, the smoothed L-estimator converges 
to the empirical one). 

Figure 2 shows the same simulations under the Student and exponential 
distributions. Under the Student distribution, the performance of the Nadaraya- 
Watson estimator is clearly the worst one. Further, the smoothed local L-estimator 
is for all choices of the bandwidth hy slightly better than its empirical counterpart. 
Under the exponential distribution, the Nadar aya- Watson estimator is inferior to 
all smoothed L-estimators. The smoothed local L-estimator outperforms the em- 
pirical one even in this case, although just for bandwidths hy < 0.10 on the whole 
interval (0.1, 0.9). This together with the fact that, contrary to the Gaussian case, 
the smallest bandwidth hy is the best choice in the case of the double exponen- 
tial distribution indicates the importance of a bandwidth-choice procedure (see 
Subsection 3.3). 

3.2. Contaminated Data 

To study robust properties of local L-estimators, we used the same model as in 
Subsection 3.1 and add 3 observations at x = 0. 25, 0.50, 0.75 with y- values coming 
from the uniform distribution on (—1.5, 1.5). The simulation results for all the es- 
timators discussed in this paper are summarized in Figure 3. Difference to the pre- 
vious simulations in Subsection 3.1 is that a small level of trimming a = 0.1 is used 
instead of a = 0.25 to highlight the behavior under small and large levels of trim- 
ming. First of all, one can notice the sharp peaks in the mean square errors of the 
Nadar aya- Wat son estimator around the outliers (they actually exceed the range 
used on the MSB axis). Second, the smoothed L-estimator is apparently better than 
its empirical version, especially for 50% trimming. The difference between MSB 
of the smoothed and empirical L-estimators in the case of contaminated data sets 
seems to be roughly the same as in the case of clean Gaussian data, see Figure 1. 
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Mean Square Errors 



Mean Square Errors 




(e ^ Exp^ a = 0 . 25 , c = 2 ct) 



{e ^ Exp, Q = 0.5, c = a) 



Figure 2. Mean square errors of the Nadar aya- Wat son (dotted 
line), empirical (thick solid line) and smoothed (solid lines with 
crosses for hy = 0.05, circles for = 0.10, triangles for = 0.15, 
and squares for hy = 0.20) local L-estimators. The trimming level 
of L-estimator is denoted a. 

On the other hand, it is interesting to see how the smoothed L-estimator 
may react to outliers, although only to a limited extent, for some choices of band- 
width hy. Especially for o; = 0.10 trimming, the effect of outliers on the regres- 
sion estimates increases with the cdf smoothing bandwidth hy. This is caused 
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Mean Square Errors Mean Square Errors 





(3 outliers; e ~ iV, a = 0.1, c = 2a) (3 outliers; e ~ AT, a — 0.5, c = a) 

Figure 3. Mean square errors of the Nadar aya- Wat son (dotted 
line), empirical (thick solid line) and smoothed (solid lines with 
crosses for hy = 0.05, circles for hy = 0.10, triangles for hy = 0.15, 
and squares for hy = 0.20) local L-estimators. The trimming level 
of L-estimator is denoted a. 



by the smoothing term Ki[{y - Yi)/hy] in equation (2.4): (i) an outlier Yf > y 
has a small positive impact on any value of the smoothed cdf Fx{y) as long as 
Ki[{y — Yf)/hy] > 0 (this is always true for the Gaussian kernel used here); (ii) 
this effect increases with hy because Kj is an increasing function (it is the integral 
of a kernel). 



3.3. Bandwidth Choice 

As gradually became clear in Subsections 3.1 and 3.2, the use of smoothed cdf 
improves the performance of local L-estimator, but the extent of the improvement 
depends significantly on the choice of the bandwidth hy (the other bandwidth hx 
is common to all nonparametric regression estimators). To determine the optimal 
bandwidth pair {hx^hy), one should use a robust bandwidth-selection procedure 
because the standard methods for the bandwidth selection, such as least-squares 
cross validation, can exhibit downward bias in presence of outliers. Although we 
did not explore the bandwidth selection to a greater extent, established robust 
bandwidth-selection methods such as, for example, the L\ cross validation by 
Wang and Scott (1994) and robust plug-in methods by Boente et al. (1997) can 
generally be recommended. 
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4. Application: Implied Volatility Surface 

The analysis of volatility, that is, of the conditional variance of financial time series 
(such as exchange rates or stock prices), is nowadays one of the most important 
issues in modern finance. Moreover, instead of volatilities estimated from historical 
data, the so-called implied volatility approach is gaining importance. The implied 
volatility of a stock or an index is derived from the currently observed prices of 
related financial instruments (options) on the market and thus reflects the cur- 
rent market expectations. Having data on a sufficient number of transactions, one 
can estimate a mapping that gives implied volatilities for various combinations of 
maturities and prices, which is called implied volatility surface (see Hardle et al., 
2002, Chapter 6, for more details). The series of these surfaces are then usually 
studied in time to reveal the dynamics of implied volatilities and to predict future 
development. Thus, estimating these surfaces is of great importance. 



Volatility Surface 




Figure 4. Implied volatility surface for DAX (German stock in- 
dex) on January 4, 1999. 

The implied volatility surface on a specified grid is typically estimated by 
a two-dimensional nonparametric smoothing procedure; mostly, the Nadaraya- 
Watson estimator with a quartic kernel is employed, see Ait-Sahalia and Lo (1998, 
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2000) or Fengler et al. (2002), for instance. An example of an estimated surface is 
on Figure 4. On the other hand, the data on implied volatilities are often scarce or 
of low quality for some combinations of maturities and strike prices. Since errors 
in estimation are transferred to any subsequent analysis, it would be wise to use 
a more robust local smoothing procedure than the Nadaraya- Watson estimator. 
Given the simulation results and practical aspects, the smoothed local L-estimator 
seems to be a very good candidate. It can be applied to this problem, since both in 
the nonparametric regression and the implied volatility estimation one estimates 
an expectation of a random variable conditional on values of exogenous variables. 



Volatility Surface 



Volatility Surface 





(April 16, 1999) 



(February 26, 1999) 



Figure 5. Implied volatility surfaces estimated by the Nadaraya- 
Watson (solid line) and smoothed local L-estimator (dotted line). 

To demonstrate the impact of using a robust smoothing procedure instead 
of the standard Nadaraya- Watson smoother, we estimated the implied volatility 
surfaces for DAX (a German stock index) on two selected days. The options on 
DAX are most actively traded contracts at the derivative exchange EUREX and 
should therefore provide observations on different maturities and strike prices in 
sufficient quantity and quality. Moreover, data are in practice cleared from un- 
usual contracts and big errors. Nevertheless, as we demonstrate, the use of simple 
Nadaraya- Watson procedure may still introduce large errors that are then inher- 
ited by any subsequently analysis. 

Let us compare the implied-volatility surface estimates on two days: April 16, 
1999, and February 26, 1999. The estimates produced by the Nadaraya- Watson 
and smoothed local L-estimate for maturities ranging from 10 days to 3 months are 
on Figure 5 (25% trimming is used; the estimated surfaces are almost the same also 
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for trimming levels from 10% to 40%). In the first case, the estimated surfaces are 
almost identical and this is probably the most usual case. In the second case, there 
is quite a big difference between the estimated surfaces. The most likely reason 
is that there is a large difference in volatility between shorter maturities (one to 
three months) and longer ones: since the discrete nature of observed maturities 
often enforces oversmoothing, the different volatilities at longer maturities can 
adversely affect the Nadar aya- Watson estimates. This adverse effect is much less 
pronounced in the case of the smoothed local L-estimation as can be verified by 
looking at the average mean square error for observations within the surface area. 
One can also notice that the volatility surfaces estimated by the smoothed L- 
estimator have roughly the same magnitude on both days, while the volatility 
magnitudes of Nadar aya- Watson estimates differ greatly. Hence, a more robust 
procedure can “survive” a strange behavior on the market and can actually detect 
it by comparing with standard nonparametric estimators. 

Concluding Remarks 

Following the Tamine et al. (2003) proposal of the smoothed local L-estimator, 
we improved the computation procedure and studied its small sample behavior. 
The smoothed local L-estimation exhibits better finite-sample properties than the 
empirical one and is faster to compute. As demonstrated, the use of such a robust 
technique may be highly desirable, for example, in modeling volatilities of financial 
instruments. 
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Fast Algorithms for Computing 
High Breakdown Covariance Matrices 
with Missing Data 

S. Copt and M.-P. Victoria-Feser 



Abstract. Robust estimation of covariance matrices when some of the data 
at hand are missing is an important problem. It has been studied by Little 
and Smith (1987) and more recently by Cheng and Victoria-Feser (2002). The 
latter propose the use of high breakdown estimators and so-called hybrid al- 
gorithms (see, e.g., Woodruff and Rocke, 1994). In particular, the minimum 
volume ellipsoid of Rousseeuw (1984) is adapted to the case of missing data. 
To compute it, they use (a modified version of) the forward search algorithm 
(see e.g. Atkinson, 1994). In this paper, we propose to use instead a modifica- 
tion of the C-step algorithm proposed by Rousseeuw and Van Driessen (1999) 
which is actually a lot faster. We also adapt the orthogonalized Gnanadesikan- 
Kettenring (OGK) estimator proposed by Maronna and Zamar (2002) to the 
case of missing data and use it as a starting point for an adapted iS-estimator. 
Moreover, we conduct a simulation study to compare different robust estima- 
tors in terms of their efficiency and breakdown. 

Mathematics Subject Classification (2000). Primary 62G35; Secondary 62H12. 
Keywords. C-step algorithm, minimum covariance determinant, outliers, ro- 
bust statistics, 5-estimators, orthogonalized Gnanadesikan-Kettering robust 
estimator. 



1. Introduction 

The statistical literature contains several proposals for high breakdown estimators 
of the mean and covariance in multivariate data when it is suspected that the data 
contain outliers or extreme observations. A well known one is the minimum co- 
variance determinant (MCD) of Rousseeuw (1984). However, little has been done 
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to consider also the case of missing data which is in practice very common. Only 
Little and Smith (1987) and Cheng and Victoria-Feser (2002) propose different 
solutions. In this paper we actually concentrate on robust estimators with missing 
data, in particular we propose the use of faster algorithms for their computation 
and compare them through extensive simulations in terms of their robustness prop- 
erties when data are contaminated and also in terms of the speed of two different 
algorithms used to compute the robust estimators. In particular, we adapt the 
orthogonalized Gnanadesikan-Kettenring {OGK) estimator proposed by Maronna 
and Zamar (2002) to the case of missing data and use it as a starting point for an 
adapted 5-estimator. All our programs are readily available (upon request) in the 
form of an Spins library which has been used to produce the results and graphics 
presented in this paper. 



2. A General Class of Estimators with Missing Data 



The aim is to estimate the parameters // and S, i.e., the mean and covariance of an 
underlying multivariate variable Y = (Yi, . . . ,1^) that has supposedly generated 
the sample yi,i = 1, . . . ,n at hand. As it often happens in practice, we suppose 
that some of the observations might be missing in that some of the yij are observed 
for some j E {1, . . . ,p} and the others are not observed or missing for the other 
j’s. In other terms, so that a distinction is made between the 

observed {oi) and the missing {mi) data. We suppose that the data are missing 
at random (see Rubin, 1976), a sufficient condition for correct likelihood-based 
inferences. Most known estimators of mean and covariance with missing data fall 
in the class proposed by Cheng and Victoria-Feser (2002), i.e.. 



where 



and 



i=l 

- Y iiyi - ^)(y* “ = 0 



2=1 



Yi = 



yfoi]^E[yimi]\y[oi],f^,'^y 



yfoi]^f^[mi] + (y(ot] - Mfoi])'' 



Tvi-l 






Ci = 



0 0 

0 cov y[mi],ylrii] 

0 0 



( 2 . 1 ) 

( 2 . 2 ) 



(2.3) 



(2.4) 
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where for example denotes the partition of S corresponding to the observed 
part of y^, etc. The diflFerent estimators are actually defined through the data 
weighting system given by , wf and in (2.1) and (2.2) which in turn also 
depend on the parameters and S (see below). To compute the estimators, one 
can use an iterative procedure in which given current estimates and the 
Yi, Ci and the weights are first computed, and the values are then updated by 



7(^+1) 



n / 1 

/ i=l 









L 2=1 



L 2=1 



(2.5) 



(2.6) 



We have not worked out the conditions for (2.1) and (2.2) in particular to 
have a unique solution and for (2.5) and (2.6) to converge to that unique solution. 
However, the reader is referred to Davies (1987) for general conditions for S- 
estimators. 

The classical MLE is obtained when = wf = 1 Vz, and (2.5) and 

(2.6) define the EM algorithm (Dempster et al., 1977). However, with complete 
data it is well known that the MLE of mean and covariance are not robust. When 
there are missing data, the situation does not change; see Cheng and Victoria- 
Feser (2002). Little and Rubin (1987) propose to base the M-step on a robust 
estimator belonging to the general class of M-estimator (Huber, 1981). They call 
the resulting procedure the ER algorithm. Their estimator is defined by (2.5) and 

(2.6) with 






W, = Wj 






(2.7) 



where 

^oi ~ ^oi (/^? ~ (^M ~ H'loi]) ^[002] ~ 

is the squared Mahalanobis distance corresponding to the observed part of y^. See 
Little and Smith (1987) for the choice of the weight function u. The iteration step 
for the covariance matrix (2.6) does not exactly correspond to the same step in the 
ER algorithm in that the weights wf are not applied to the correction matrix C^. 
We will however, in what follows consider this slight modification of the ER algo- 
rithm (Cheng and Victoria-Feser, 2002). If the case i is uncontaminated, the data 
are normal and missing values are missing at random, then (2.8) is asymptotically 
Xp. where pi = dim (y[oij)- The Wilson-Hilferty transformation of the chi-squared 
distribution yields {df/piY^^ ~ A/'(l — 2/(9pi), 2/(9p^)). Following Little and Smith 
(1987), we also propose a probability plot of 



(rffM)V3-i + 2/(9pQ 



(2.9) 



versus standard normal order statistics, that should reveal atypical observations. 

Little and Smith (1987) proposed as starting point of the ER algorithm, the 
MLE on the data where the missing ones have been replaced by the median of 




74 



S. Copt and M.-P. Victoria-Feser 



the corresponding observations. Although the ER algorithm is relatively simple 
to implement, it suffers from an important drawback : its breakdown point is at 
most l/(p+ 1) because it is based on a weighting scheme that is not redescending. 
This drawback will be highlighted by the simulation results. This means that if the 
proportion of outliers exceeds this value (or even is near it) the robust estimator 
is not robust anymore. 

To construct a high breakdown estimator of mean and covariance matrix 
in multivariate data when some observations are missing, Cheng and Victoria- 
Feser (2002) propose two strategies. The first one is to provide a high breakdown 
estimator such as the MCD estimator as starting point for the ER algorithm and 
the second is to also adapt a high breakdown estimator such as an 5-estimator 
(Rousseeuw and Yohai, 1984) to incomplete data. The resulting estimator which 
is called the ERTBS is then defined through (2.5) and (2.6) with 

< = w^/p = wf/ d} = tp{di) I di , (2.10) 

di = doi (m, jk with p, and S being the current values of the high breakdown 
point estimator, and 

h— 

where d[gj denotes the qth ordered distance (based on the doi ^j5, S^), g = [(n + 
p + 1)/2J with [x\ denoting the integer part of x, and 

{ d 0<d<M 

M<d<M + c 
0 d> M c. 

This -0 function defines the translated biweight 5-estimator proposed by Rocke 
(1996). The parameters M and c control the breakdown point e* and the asymp- 
totic rejection probability ARP a of the ERTBS. The ARP can be interpreted as 
the probability for an estimator, in large samples under a reference distribution, 
to give a null (or nearly null) weight. M and c are found implicitly by 

e*max/9(d;c,M) = E^ 2 [p{d-, c, M)] , 

M + c = Y^(Xp)“Hl -«) ; 

where p is the primitive of -0 (see Rocke, 1996). The choices for and a are to be 
made by the analyst. The former is the suspected maximal amount of contaminated 
data and for the latter Cheng and Victoria-Feser (2002) propose choices between 
0.1% and 1%. 

As Rocke (1996) noted, it is very important to choose a good starting point 
for any algorithm defining a high breakdown point estimator, otherwise the latter 
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can loose its high breakdown properties. For the ERTBS^ Cheng and Victoria- 
Feser (2002) therefore propose an adaptation of the MCD estimator as a starting 
point as well as an algorithm to compute it. However, to compute the MCD 
one needs algorithms that are based on random starting subsamples. This can 
lead to situations in which the MCD is very slow to compute, if not impossible. 
Therefore, in the following section, we propose a fast algorithm to compute the 
MCD by adapting the FAST-MCD of Rousseeuw and Van Driessen (1999) and as 
an even faster alternative, we propose a modified version of the OGK estimator 
adapted to the case of missing data to be used as a starting point for the ERTBS. 



3. Starting Point Robust Estimators with Missing Data 

3.1. The Modified MCD 

The objective of the MCD estimator is to find h observations (out of n) whose 
covariance matrix has the lowest determinant. The MCD mean estimator is then 
the sample mean of those h points, and the MCD covariance estimator is their 
sample covariance matrix. To compute the MCD, one needs an algorithm for 
finding the best subset of h points, which usually involves the repeated computa- 
tion of the sample mean and covariance as well as Mahalanobis distances. When 
some observations are missing, Cheng and Victoria-Feser (2002) propose to use the 
EM algorithm to compute the sample means and covariances at all steps of the 
algorithm and to base the Mahalanobis distances on the observed part of the ob- 
servation as in (2.8). The latter are standardized by means of the Wilson-Hilferty 
transformation given in (2.9), so that one takes into account the unequal number 
of missing values for each observation. 

A choice needs to be made on h and one way is to choose it such that 
the MCD has the highest breakdown. In this case, the minimal value of h is 
(Rousseeuw and Leroy, 1987) h := . But this is also the choice that gives 

the largest efficiency loss. So when we suspect that the sample is not heavily 
contaminated we can reasonably choose a larger value for the proportion of points 
of say 75% or 80% so we can take h := [0.75nJ or h := L0.80nJ. 

To run the MCD, Cheng and Victoria-Feser (2002) adapt the forward search 
algorithm proposed by Atkinson (1993, 1994). However, more recently Rousseeuw 
and Van Driessen (1999) have proposed a new algorithm called FAST-MCD sup- 
posed to be even faster than the forward search algorithm and able to deal with 
very large data sets. In this paper, we propose to adapt it to compute the MCD 
when there are missing data. 

A key idea of the FAST-MCD algorithm is the fact that starting from any 
approximation to the MCD, it is possible to find an approximation with a lower 
determinant. Indeed Rousseeuw and Van Driessen (1999) observed that from a 
subset Hk of size h in which fi, E and the Mahalanobis distances are computed, 
one can create a subset Hk-^i by taking among the n observations the h ones with 
the smallest Mahalanobis distances with the property that the determinant of S 
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based on i/fc+i is smaller. Each step is called a C-step. The initial subset is created 
by choosing randomly p + 1 observations on which the Mahalanobis distances are 
computed to order the n observations. The first h ones define the initial subset 
Hi. If the determinant of E based on the randomly chosen p + 1 observations is 
null, one adds one randomly chosen observation at the time until the determinant 
becomes positive. If for any subset Hk there are missing values, we compute fik 
and Efc with the EM algorithm. The Mahalanobis distances are also changed as 
in (2.8) and standardized using the Wilson-Hilferty transformation. The absolute 
value of the latter is used to order the observations. The initial subset is created 
choosing randomly p + 1 observations among the fully observed ones. For more 
details, see Copt and Victoria-Feser (2003). 

Through extensive simulations we compare, in Section 4, the forward search 
algorithm and the FAST-MCD algorithm for the computation of the MCD with 
missing data. 



3.2. The Modified OGK 



Maronna and Zamar (2002) base their OGK on the robust estimator for covari- 
ances (jjk proposed by Gnanadesikan and Kettenring (1972) which is very simple to 
compute. Indeed the latter is defined for a pair of random variables (i.e., p = 2) as 

i(cr(lS+n)"-a(F^-yfc)2) 

where cr() is a standard deviation function applied on its argument. A robust esti- 
mator for ajk is obtained when a{) is a robust function. When p > 2, the covariance 
matrix S is estimated by replacing all its elements by all pairwise estimates. It is 
well known that such an estimator may produce non positive definite matrices and 
the estimator is not affine equivariant. To overcome the lack of positive definite- 
ness, Maronna and Zamar (2002) propose an estimator defined by the following 
four steps: 

1. Let D = diag {a {Yj))\j^^ ^ and define = 1, . . . , n, i.e., real- 

izations from X = (Ai, . . . , Xp ) . 

2. Compute the matrix U = {ujk) with 

uj, ^ I 3 t ^ (3-1) 

11 j = k. 



3. Decompose U as U = EAE^ with A =diag(Ai, . . . , \p ) . 

4. Define = E^x^, i.e., realizations from Z = (^i, . . . , Zp) and A = DE. The 



estimator of S is AFA^ with F = diag (a (^j)^) 

A location estimator for p is given by A^' with v = {Tn{Zj))\.^^ m() 

being a (robust) mean function. The procedure can be iterated by replacing U in 
step 2 by EFE^ until convergence. For the choice of a{) and m(), see Maronna 
and Zamar (2002). The latter argue that to improve the efficiency of the OGiF, 
one could use it as a hard rejection tool in that a reweighted estimator as in (2.1) 
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is used in which = y^Vi and = wf — Wi with 

y, . ^ / 1 (yi - Mogk)^ ^ogk iVi ^ Mogk) < Xp(-9) 

^ \ 0 otherwise. 

The resulting estimator will be called the reweighted OGK {rOGK). Note that 
this strategy is also used most of the times with the MGD but with the quantile 

0. 975 (instead of 0.9) of the Xp* We will call the resulting estimator the rMGD. 
We have not derived the formal conditions for convergence to a unique solution of 
the algorithm used to compute the (rOGK)^ but in our simulations we found that 
it converged in all our samples. 

To extend the OGK or rOGK to the case of missing data, we propose to im- 
pute the missing values by means of the in (2.3) obtained by the EM algorithm, 

1. e., with n and S estimated by (2.1) where all weights are equal to 1. The reason 
is that the EM algorithm is very fast, and although it leads to biased estimates 
of and S and therefore of the imputed values y^, we found through extensive 
simulation studies, that in practice the rOGK is not or very little affected (see 
Section 4). In future research, we will seek a better adaptation of the OGK to the 
case of missing data. 



4. Simulation Study 

4.1. The Design 

The model is the multivariate normal distribution A^(/x, S). Because the OGK is 
not affine equivariant, following Maronna and Zamar (2002), we choose to simulate 
correlated data, i.e., S = R(p)^ where R(p) = p for the elements j ^ k and 1 
for the others, with p = 0.2. We also chose p = 0. The data were contaminated 
by so-called shift-outliers (Woodruff and Rocke, 1993), i.e., an e- proportion was 
generated using iV((^ + f3/y/2) ep, S) where Gp is a p-dimensional vector of ones. 
We set (3 = 1.6 and e = 0, 0.02, 0.05 and 0.1. The missing data, if any, were chosen 
randomly among the mixture distribution between the good and the bad data. 
We chose proportions miss ==0.1, 0.2 and 0.3. Table 1 shows the different values 
for n and p used in the simulations. Each robust estimator requires a decision on 

Table 1. Values for n and p. 





p = 10 


p = 20 


p = 50 


n = 


50 


100 


200 


n = 


100 


200 


400 


n = 


500 


500 


600 



its initialization parameters. For the MOD estimator, h = [0.6n] was chosen. For 
the OGK^ Cl = 4.5 and C 2 = 3 were chosen (see Maronna and Zamar, 2002). For 
the ERTBS estimator we chose for our simulations the breakdown point e* = 0.3 
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and the ARP a = 0.001. All computational experiments were done on a Athlon 
lOOOMhz with 512 MB of memory. The core of the program was written in Fortran 
77 and Splus was used as a front-end (to produce the various graphics). For all 
combinations of parameters, 1000 samples were generated. 

4.2. Computational Times 

The computational time to compute the ER and the ERTBS depends essentially 
on the choice of the starting point. Therefore, we compute here the time (as func- 
tion of the sample size n) needed to compute the rOGK or the rMCD^ when the 
latter is computed using the adapted FAST-MCD algorithm {r MOD / FAST) or 
the forward search algorithm {rMCD/FWD). We chose the reweighted version 
of the two starting point estimators, because as is shown in Copt and Victoria- 
Feser (2003), the non-reweighted versions can lead to biased estimates. For each 
value of miss and e and the values of Table 1, a time in seconds has been com- 
puted. Figure 1 shows the results (in a log-scale) for the datasets with e = 10% 
and miss = 30% (for other combinations the results are comparatively similar). 
We notice the following features. The speed for the rMCD/FWD as expected is 



rWCO/FAST rOGK fWCO/FWO 




nurntwr <if Dbsona&wii rumtwf cri otoerraboii 



Figure 1. Log of mean time in seconds needed to compute the 
rOGK and the rMCD by means of the forward search (FWD) 
algorithm and the FAST-MCD algorithm, as a function of the 
sample size and the data dimension p. 

slower than the speed of the rMCD/ FAST ^ with an increasing difference as the 
sample size increases. The rMCD / F AST can be up to 150 times faster than the 
rMCD/FWD. However, when the rOGK is used as a starting point, the com- 
putational times decrease drastically, with sometimes a ratio of 18 compared with 
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the rMCD/FAST. However, the speed of the rMCD/FAST does not depend 
very much on the sample size n, whereas the rOGK does quite substantially. 



4.3. Comparing Estimators 

The aim of this subsection is to study the robustness properties (bias versus ef- 
ficiency) of the different estimators proposed with incomplete data by means of 
simulations. For the MCD, all calculations were made using the modified FAST- 
MCD for missing data. It should be stressed that this exercise has not been done 
in Cheng and Victoria- Feser (2002). Copt and Victoria- Feser (2003) compare the 
behavior of the MOD, rMCD^ OGK and rOGK. They conclude that both the 
rMGD and rOGK behave nicely (in terms of bias) even with correlated data 
(S = R(.2)^), whereas the OGK for the variances and covariances can be biased 
with contaminated data. 

The estimators we consider here are the final estimators namely the MLE 
computed via the EM algorithm (which is taken as a benchmark), the ER algo- 
rithm with the MLE as starting point {ER/ MLE), the ER algorithm with the 
rMGD, and rOGK as starting point {ER/rMGD and ER/rOGK), the ERTBS 
algorithm with the rMGD and rOGK as starting point {ERTBS /rMGD and 
ERTBS /rOGK). The data were generated using the designs presented in Section 
4.1. The percentage of missing observations and the sizes n and p do not seem 
to have an infiuence on the behavior of the different estimators. The infiuential 
factor is the percentage of contamination and potentially, the covariance structure 
when the rOGK is chosen as starting point. We therefore chose the V(0,R(.2)^) 
as data generating model. We use boxplots to compare the estimators. They are 
built on the estimated biases of one of the elements of the mean vector, one of the 
diagonal elements of the covariance matrix and one of the off-diagonal elements of 
the covariance matrix. Only the results for /ii, an, and a \2 are represented, since 
for other parameters, the same pattern is found. 

Figures 2 and 3 show the boxplots of the sampling distributions of the fi- 
nal estimators using different amounts of data contamination. The MLE clearly 
fails even if the contamination is small. However it is the most efficient with no 
contamination but the efficiency loss for the robust estimators seems to be quite 
small. The ER/MLE breaks down at 5% of data contamination. Finally, the 
ER/rMGD, ER/rOGK, ERTBS/rMGD or ERTBS/rOGK are very robust 
and can withstand at least 10% of data contamination. 

If we want to see a difference between the ER and ERTBS with the same 
high breakdown starting point, we have to push the percentage of contamination up 
to 30%. We have not done a full coverage of such situation since it is very unlikely 
that someone will want to study data sets with such a percentage of contamination. 
However, Copt and Victoria-Feser (2003) show a simulated example with 30% of 
contaminated data in which the ER/rOGK (and also the ER/rMGD) clearly 
fails to detect the outliers, whereas the ERTBS/rOGK is not affected by their 
presence in the sample. 
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Figure 2. Sampling distribution of the final high breakdown es- 
timators with missing data for (a) 0% and (b) 2% of data con- 
tamination. 



5. Conclusion 

In this paper, we have considered (efficient) high breakdown estimators of the 
mean and covariance of a multivariate normal distribution with missing data. The 
computational speed of the estimators depends on the computational speed of 
their starting point. The fastest is the modified OGK, although its speed depends 
on the sample size, which is not so strongly the case for the modified MCD using 
the FAST-MCD algorithm to compute it. For their performance in terms of bias, 
efficiency and breakdown point, the conclusion is that a high breakdown estimator 
is crucial ais starting point, among which the (modified) OGK can be biased when 
the data are correlated but the (modified) rOGK is not, and that the ERTBS 
has a larger breakdown point as the ER. The EM is not robust and the ER/EM 
has a very small breakdown point. Finally, under no contamination, the EM is 
the most efficient estimator but the ERTBS has a very small efficiency loss. 
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Figure 3. Sampling distribution of the final high breakdown es- 
timators with missing data for (a) 5% and (b) 10% of data con- 
tamination. 
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Generalized d-fullness Technique for 
Breakdown Point Study of the Trinuned 
Likelihood Estimator with Application 

R.B. Dimova and N.M. Neykov 



Abstract. The d-fullness technique of Vandev (1993) for the finite sample 
breakdown point study of the Weighted Trimmed Likelihood Estimator is 
extended. The proposed generalized d- fullness technique is illustrated over 
the generalized logistic regression model. 
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Keywords. Breakdown point, d- fullness, trimmed likelihood estimator, gener- 
alized logistic regression. 



1. Introduction 

The classical Maximum Likelihood Estimator (MLE) can be very sensitive to out- 
liers in the data. In fact, even a single outlier can ruin completely the MLE. To 
overcome this problem many robust alternatives of the MLE have been developed 
(see Atkinson and Riani, 2000; Bednarski and Clarke, 1993; Beran, 1982; Christ- 
mann, 1994; Field and Smith, 1994; Hampel et al., 1986; Huber, 1981; Hubert, 
1997; Marazzi and Yohai, 2003; Markatou et al., 1997; Neykov and Neytchev, 
1990; Shane and SimonofF, 2001; Vandev and Neykov, 1993; Windham, 1995; Choi 
et al., 2000). 

Hadi and Luceno (1997) and Vandev and Neykov (1998) introduced a robust 
parametric modification of the MLE called the Weighted Trimmed Likelihood 
(WTL) estimator. The basic idea behind the trimming in the proposed estimator is 
in the removal of those observations whose values would be highly unlikely to occur 
if the fitted model was true. These authors showed that under appropriate choices 
of the trimming parameter and the weights, the WTL estimator reduces to the 
MLE, to the LMS and LTS estimators of Rousseeuw (1984) in the case of normal 
regression, and to the MVE and MCD estimators of the multivariate location and 
scatter introduced by Rousseeuw (1985) in the multivariate normal case. 
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The ^/n consistency of the WTL estimator is derived by Cizek (2002). 

Algorithms for TL estimation in the univariate case were developed by Hadi 
and Luceho (1997), whereas Neykov and Muller (2002) proposed a FAST-TLE 
algorithm in the framework of the generalized linear models. 

The breakdown point (BP) properties of the WTL estimator were studied 
by Vandev and Neykov (1998) and Muller and Neykov (2003) using the d-fullness 
technique of Vandev (1993). According to Vandev and Neykov (1998), a set F = 
{/i? • • • 7 /n} of arbitrary functions /^ : 0 — > 0 C is called d-full if for every 

subset J C {!,..., n} of cardinality d {\J\ = d) the function gj{0) = max/j(0), 

0 G 0, is subcompact. A function ^ : 0 E, 0 C E^ is called subcompact if its 
Lebesgue set Lg{C) = {0 G 0 : g{6) < C} is contained in a compact set for every 
real constant C, as defined by Muller and Neykov (2003). 

The requirement for d-fullness of the set F is restrictive, more precisely, the 
condition for every real constant C in the definition of a subcompact function is 
not always satisfied. For instance the corresponding set F of log likelihoods for 
the mixtures of univariate/multivariate normal or binomial distributions, just to 
name but a few, are not d-full in the above sense. 

In this paper a generalized d-fullness technique is proposed to study the BP 
of the WTL estimator for a wider class of functions containing the class of subcom- 
pact functions. Section 2 defines the concept of breakdown point and generalized 
d-fullness. This technique is illustrated for the generalized logistic regression model 
in Section 3. The proofs of the lemmas and propositions are given in the Appendix. 



2. Definitions and Generalized d-fullness Technique 

To aid the presentation we remind the replacement variant of the finite sample 
BP given in Hampel et al. (1986), which is closely related to that introduced by 
Donoho and Huber (1983). Let X = {xi G A' C E^, for z = 1, . . . , n} be a sample 
of size n. 

Definition 2.1. The BP of an estimator T at X is given by 

SniT) = ^max{m : sup||T(Xm)|| < oo}, 

where Xrn is a sample obtained from X by replacing any m of the points in X by 
arbitrary values from A, and ||.|| is the Euklidean norm. 

We now recall the definition of the Weighted Trimmed estimator given in 
Vandev and Neykov (1998). Let f : X x Q ^ E“^, where 0 C E^ be an open set, 
and F = {fi{6) = f{xi,6), for i == 1, . . . ,n}. 

Definition 2.2. The Weighted Trimmed estimator is defined as 

k 

Wfe := argmin Vw^(i)/;,(j)(6>), 

i=l 



(2.1) 
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where fi,(i){0) < fy{ 2 ){^) < • • • < /i/(n)(^) ^re the ordered values of fi at 0, z/ = 
(z/(l), . . . , iy{n)) is the corresponding permutation of the indices, which depends on 
k is the trimming parameter, the weights > 0 for i = 1, . . . , n are associated 
with the functions fi{0) and are such that Wi,(^k) > 0- 

The Wfc estimator is too general to be of practical use. However, many high 
breakdown statistical estimators can be derived from it. For instance, let fi{6) = 
g{\ri{0)\) for i = 1, . . . ,n, where ^ be a continuous monotonic function such that 
^(0) = 0 and Vi{0) = yi — xj 0 he the ith linear regression residual, generated 
by the observation {yi.xj) G and the unknown parameter 0 e Q C MP. 

Then the LMS, LTS and LQS regression estimators of Rousseeuw (1984), the 
h-trimmed weighted Lq estimator of Muller (1995), and the Hossjer (1994) rank- 
based estimator can be obtained as special cases of the estimator. If Xi e X for 
i = 1, . . . ,n be i.i.d. observations with probability density function 0(x,0), which 
depends on an unknown parameter 0 G 0 C and f{xi,0) = — log(/>(x 2 , 0), 
then the estimator coincides with the WTLfc estimator proposed by Hadi 
and Luceho (1997) and Vandev and Neykov (1998). In particular, when 0(x,0) 
is the multivariate normal density function, the Wk reduces to the MVE and 
MCD estimators of the multivariate location and scatter considered by Rousseeuw 
(1985) (see Vandev and Neykov, 1998). Vandev and Neykov (1998) prove that 
the BP of the Wk is not less than (n - k)/n if the set F is d-full, n > M and 
(n F d)l2 < k < n — d. Muller and Neykov (2003) extend their result finding the 
lower bound of the BP without additional assumptions on k and n. Moreover, 
for the generalized linear models it was shown that the fullness parameter d is 
related with the quantity Af{X) of Muller (1995), where X is the matrix of the 
explanatory variables which may not be in general position, as is the case with the 
designed experiment, or are generated by qualitative factors. 

We need the following notations in the presentation of the generalized d- 
fullness technique. 

Let ^ be a function such that g : Q ^ R, dQ he the set of the boundary 

points of 0, and 0oo = ^ 0, \\0k\\ oo} be the set of all sequences 

whose norm tends to infinity. Then g is defined as 

inf limmf q(0k), if 0 is bounded, or 
e*ed& Ok^e* ^ ^ 

g = I (2.2) 

inf liminf g(0k), if 0 is unbounded. 
e-ede ^ ^ 

^ {^fc}£©oo 

Let us introduce the following conditions: 

Al. F = {fi{0) >0, i = 1, . . . , n, for 0 G 0} be a set of continuous functions; 
A2. There exists 0q G 0, such that for every subset J C {1, . . . , n} of cardinality 

d, c**gj{0o) < (7, where gj{0) = maxfj{0) and C = inf g^^ c** = c* = 

jeJ J 

k 

, and w* = min{wj >0, j = 1, . . . , n}. 
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Remark 2.3. The d-full sets class of functions is a special case of the class of 
sets of functions satisfying A1 and A2 conditions, because if ^ = oo, then ^ is a 
subcompact function. This follows from Lemma 4.1 given in the Appendix. 

The following proposition gives the necessary conditions under which there 
exists a solution of the optimization problem (2.1). 

Proposition 2.4. If k> d, A1 and A2 hold, then Wk is a non-empty compact set 

The next proposition gives a lower bound for the BP of Wjfc for a set of func- 
tions F satisfying Al and A2 conditions. It is a generalization of the corresponding 
result of Vandev and Neykov (1998) who required d- fullness of F. 

Proposition 2.5. The BP ofWk is not less than ^min(n — k,k — d) if Al and A2 
hold. 

Remark 2.6. The above propositions hold for the WTL^ estimator. Thus, if one 
wishes to study the BP of the WTL/c estimators for a particular distribution, one 
has to establish the validity of the conditions Al and A2 for the corresponding set 
of functions fi{6) = — log (j){xi,6) for i = 1, . . . , n. Then the BP can be exemplified 
by the range of values of k by Proposition 2.5. 

3. Application on a Generalized Logistic Regression Model 

As an illustration of the above propositions we consider the grouped binary linear 
regression model with generalized logistic link. The type of the data under consid- 
eration is of the form {Vi,xJ) for i = 1, . . . , AT. It is assumed that, yi is binomially 
distributed, b{yi | ni,7Ti), where the group size is ni, the probability of success is 
'Ki, and Xi is a p-dimensional vector of covariates (explanatory variables). The total 

number of observations is n == ni + ri 2 H h njv- We will assume that 0 <yi <ni 

for each i, and tt^ follows the Prentice (1976) generalized logistic distribution 

TTi = (1 4-exp(-r/i))“^, 

where a > 0, = xf /3 is the linear predictor and /3 is a p-dimensional vector of 

unknown parameters. 

The particular case, when a=l, is considered by Muller and Neykov (2003) 
who proved that the BP of the WTL^ estimator is equal to min(Ar — fc + 1, fc — 
Af{X))/N, where J\f{X) = card{z G {1, . . . ,iV};xf^ = 0}. 

We will show that the set F = {f{yi, rji,a),i = 1, . . . , N}, where f{yi, r]i, a) = 
-log{yi)+yialog{l + e“^^)-(n — yi)log(l - (1 + e“^*)“''), satisfies Conditions 
Al and A2. 

It is obvious that lim f{yi, r]i,a) = -f-oo, lim f{yi, r]i, a) = -hoc, and 

a^O a— ^+oo 

lim f{yi^r]i,a) = -f-oo. Therefore f{yi,rji,a) is a subcompact function because 

r/i— >±oo 
/ = + 00 . 

Proposition 3.1. The set {f{yi, Xi, (3,a),i = 1, . . . , N} is N'{X) + 1-full. 
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As a consequence of this proposition, the following corollary is obtained. 



Corollary 3.2. The set WTLk for the grouped binary linear regression model with 
generalized logistic link is a non-empty compact set if k> A/’(X) + 1. 



Applying Theorem 2 of Muller and Neykov (2003) we get the following 



Corollary 3.3. If [{N -\-l)/2\ <k< [{N + Af{X)-\-2)/2\, then the BP of 
the WTLk estimator for the grouped binary linear regression model with generalized 
logistic link is 



eUWTLk) > ^ 



N-Af{X) + l 
2 



We remind that [z\ := max{n :n < z}. 



4. Appendix 

The proofs in this section are based on the following Lemma. 



Lemma 4.1. Let g : Q R be continuous function, 0 C 6e an open set. If 
there exists 6 q ^ & and a real constant a> 1, such that ag{6o) < C, where C < g, 
then the set S = {6 : g{6) < C} is bounded and non-empty. 



Proof. We shall note that the set S is non-empty since ^ S. Let us assume 
that S is unbounded, be a sequence from S such that \\6j\\ — ^ oo and 

j— >oo 

rj = max{||0i||, . . . , \\0j\\}. Then we have that inf g{6) < g{6j) < C. Taking a 
limit we get g < C, which is a contradiction with C < g. 



We will use the representation f(k){^) = min max fi{0) which holds at any 

i€ik 

fixed 6 and Ik is the set of all subsets of {1, . . . , n} consisting of k elements in the 
propositions proof (see Krivulin, 1992). 



Proof of Proposition 2.4. The following inclusions hold 

( k k ^ 

W^/c I ^ ^ fv'(i)i.^) 



2=1 

k 



2=1 



= S ^ /Ki) (^) ^ ^ l/(2)/iy(2) ('^) I 

I 2=1 " 2=1 J 

i k k 

e : < jnf v{i)fu{k) ('^) 

2=1 " 2=1 

- I ^ S c* mf /^(fc) {'d) I 
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C 



u 

I eh 



0 : ^Wifi{6) < c* inf /^(fc)(??) 



C 



C 

c 

c 



c 



[j h : < c* mf 

leh I iei 



leh I iei 



< c* 



inf maxfM) 
^€0 iei 




6 : w* max fAQ) < c* inf max fi 
iei ^ ’ - i9g© iei 




u 

I eh 



9 : max fi{9) < c* 
iei 



max/i(6>o) > 
iei j 



u u{ 

leh Jci ^ 



6 : max fj{0) < c** 
jeJ 



max fj 
J€J 




leh JCI 



leh JCI 



The latter set is a non-empty bounded set according to Lemma 4.1, since A2 
condition is satisfied. Therefore, Wk is a compact set, since the functions from F 
are continuous. 



Proof of Proposition 2.5. Let F = {fi{9) = f{xi,0) for i = 1, . . . ,n} be ob- 
tained from F upon replacement of m = min{n — k,k — d} observations from 
the sample X with arbitrary ones from A', the weights Wi correspond to the 
functions that belong to F and W/e is the corresponding analog of W/c, defined 
over F. The number of the original functions in F is n — m > fc. Hence there 
exists I* G Ik such that fi{9) = fi{0) for i C P. The following inequalities 

hold hk){9) = minmax/i(0) < max fi{9) and f^(d){9) < U(k){^) (see Muller and 
leh iei iEi* 

Neykov, 2003). Then we have the inclusions 



Wfc 



9 : ^w^(^i)hi){9) < u{i)fu{i) ('^) 



i=l 



i=l 



c 



c 



0 : w*f^(^k){0) < C* mf /^(;,)(^) 

0 W < c* mf max/i(7?) 
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C 1 0 : u;* (6>) < c* max fi (0q ) | 

C M |6> : u;*/^(fc)(0) < c*max/j(6>o)| 

jc/- I- J 

|J| = d 

C M (6> : w*U^a) (0) < c* max fj (6>o) | 

JCI^ ^ J 

|J| = d 

c \J {e:Uid){0)<C} 

JCI* 

\j\=d 

According to Lemma 4.1 the final set is a non-empty bounded set. Therefore, W/e 
is a compact set, since the functions from F are continuous. 

Proof of Proposition 3.1. Let C G K is arbitrary. Since /(y^, r/^, a) is a subcompact 
function of r]i and a, there exist constants Bi and for i G / C {1,...,A'}, 
card(7) = A/’(A) + 1, such that the set 

{;S G M^,a > 0 : max/(y^,Xi,^,a) < C) 

lei 

= f^{l3 eW,a>0: f{yi,Xi,/3,a) <C} 

i€l 

= f]{P eW,a>0: f{yi,T]i = xf(3,a) <C} 

i€I 

C Pi {{/? e W : \xj < Bi} X {a : 0 < a < Aj}} 
i€l 

is contained in a bounded set. (The set {/3 G W\xf f3\ < Bi} is bounded for all Bi 
according to Lemma 3 of Muller and Neykov, 2003.) 
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On Robustness to Outliers of Parametric L2 
Estimate Criterion in the Case of Bivariate 
Normal Mixtures: a Simulation Study 

A. Durio and E.D. Isaia 

Abstract. The purpose of this work is to investigate on the use of the Inte- 
grated Square Error, or L 2 distance, as a practical tool for parameters esti- 
mation of mixtures of two normal bivariates in presence of outliers, situations 
in which maximum likelihood estimators are usually unstable. Theory is out- 
lined, closed expressions for the L 2 minimizing estimate criterion for bivariate 
gaussian mixtures are given. In order to evaluate robustness of maximum like- 
lihood and Z /2 minimizing estimate criteria we compare results arising from 
a Monte Carlo simulation for some mixtures of gaussian bivariates in occur- 
rence of different outliers positioning and consistency, matching some typical 
situations that frequently arise in industrial and chemical fields. 

Mathematics Subject Classification (2000). Primary 62F35; Secondary 62H12. 
Keywords. Bivariate normal mixtures, integrated squared error, minimum dis- 
tance estimation, outliers detection, robust multivariate estimation. 



1. Introduction 

In this paper we investigate on the use of the Integrated Square Error, or L 2 
distance, as a practical tool for parameters estimation of mixtures of two nor- 
mal bivariates in presence of outliers, situations in which Maximum Likelihood 
Estimators (MLE) are usually unstable. 

Following Scott (2001), theory is outlined and a closed expression for the L 2 
minimizing estimate criterion {L 2 E) for a mixture of two bivariate normal densities 
is presented. 

In order to evaluate robustness of ML and L 2 estimates, the latter being by 
nature less influenced by outliers (e.g., Basu et ah, 1998; Durio and Isaia, 2003) we 
compare the results, in terms of average and standard error, arising from a Monte 

This research was supported partly by the MURST 60% funds, grant EDI.es. fin. 02 and 
AD.es.fin.02. 
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Carlo simulation for some specific cases of mixtures of two normal bivariates in 
occurrence of different outliers positioning and consistency. There are some real 
situations in which few outliers lie in specific positions, for instance in the field of 
quality control process the outliers are scarce and come from a sudden state out of 
registration of the production process or the measurement system. In this paper 
we consider the outliers as realizations of rare events rather than observations 
belonging to clusters not contemplated by the model. Therefore, in order to match 
some typical situations that frequently arise in scientific areas, outliers have been 
positioned in a deterministic way and their contamination percentage has been 
kept quite low. 



2. The L 2 Estimator 

It is well known (e.g., Terrel, 1990) that the L 2 E criterion originates in the deriva- 
tion of the nonparametric least squares cross-validation algorithm to choose the 
bandwidth for the kernel estimate of a density, e.g.. Wand and Jones (1995). In 
the parametric case a multivariate r.v. X is given of dimension p > 2, with proba- 
bility density g{x\6o), depending on the unknown parameters vector Oq for which 
we introduce the model f{x\0). In order to have an estimate of 6 we minimize the 
L 2 distance with respect to 0 



f [f{x\d) - g{x\9o)f dx 
J]RP 

= [ f{x\dfdx-2E[f{x\e)]+ [ g{x\6ofdx 

J BP J BP 



(2.1) 



In (2.1) the integral p(x|0o)^ does not depend on 0. So if we esti- 
mate the expected height of the density E [f{x\0)] = g{x\0o)dx with 

E lf{x\0)] = n~^ Ya=i proposed estimator for 0q will be 



0 L 2 E = argmin 
0 



[ f{x\efdx--j2f(Xi\e) 

JlRr> n ^ 



( 2 . 2 ) 



Thus, for instance, in the case of a multivariate gaussian density </>(cc|/i, S), 
equation (2.2) becomes 



OL 2 E = argmin 

/Lt, E 



2P7TP/2|S|V2 



-- ^</.(Xi|/x,S) 



i=l 



since (t>{x\^^, S)^ dx = </>(0|0, 2 S), e.g., Scott (2001). 



Let us now consider a mixture of two bivariate gaussian densities (e.g., Tit- 
terington et al., 1985) 

/(x|/ii, /X 2 , Si, S 2 ) = w (/)(x|/ii, El) + {l-w) 0(x|/i2, S 2 ) (2.3) 

where the mixing probability w satisfies 0 < it; < 1. 
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In this situation the integral that appears in (2.2) becomes 

/ H2,'Si,'S2fdx = w‘^ (j){x\ni,'Eifdx 

Jr 

+ {1-W)‘^ [ (j){x\fi2,'^2f dx 
Jr 

+ 2w{l-w) / (|){x\nl,'El)4>{x\^l2,'S2)‘^ dx. 

Jr 



If we now observe that 



I m? 

it follows that 



/ 0(£c|/lti, Si) (|){x\^l 2 , S 2 ) dx = <p{0\fj,i - fj. 2 , Si + S 2 ) 

Jr 



[ f{x\ni, H2, Si, £ 2 )^ dx = 'll? (/)(0|0, 2 El) + (1 - wf </>(0|0, 2 S 2 ) 
Jr 

-\- 2 w{l-w) - H2, Si + S2) 

and hence equation (2.2) reduces to 



6 L 2 B = argmin 

5]i, E 2 



(t){ 0 \ 0, 2 El) + (1 - wf (/)( 0 | 0 , 2 E 2 ) 



+ 2 w{l — w) 0(O|^i — fi 2 , Si + E 2 ) — — fj, 2 , Si, E 2 ) 



i=l 



■ (2.4) 



Since closed forms for equations (2.4) are unlikely to be found, we have to 
resort to numerical minimization algorithms, such as those based on quasi-Newton 
optimization. It is important to recall that the convergence of any algorithm to 
optimal solutions strongly depends on the initialization of the parameters. 



3. The Simulation 

For our simulation study we decided to draw A; == 1 , 000 samples of size n = 200 
each from different mixtures of two bivariate gaussian densities, infected by m 
points that, according to specific rules, may be considered as outliers. Since our 
goal was to study the behavior of both ML and L 2 estimators in situations that 
frequently arise in industrial and chemical fields, we chose not to position outliers 
at random and to keep their contamination percentage quite low; in fact we set 
m = 8 (16, 32). 

We performed the simulation varying the distance between each component of 
the mixture, assuming either independence of the components of both the bivariate 
gaussian densities or dependence of the components of one of the bivariate normal 
densities. Furthermore, in order to examine the infiuence of the mixing probability 
on the estimates, we repeated the simulation with mixing probability w = .5 and 
w = .25. 
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More precisely, referring to (2.3) and letting for i = 1,2 



(Mil Mi2)^ 






4 ) 



we considered the following eight experimental conditions 



case 


'^(/>(P'll,Pl2,0’n,0'l2,Pl) + (1 - w) (p(p21, P 22 , 0-21,0-22, P 2 ) 


a 


.5 </.(-!, -1, 1/3, 1/3, 0) + .5 </>(!, 1, 1/3, 1/3, 0) 


b 


.5 <^(-1, -1, 1/3, 1/3, -.7) + .5 <^(1, 1, 1/3, 1/3, 0) 


c 


.25 </>(-!, -1, 1/3, 1/3, 0) + .75 <^(1, 1, 1/3, 1/3, 0) 


d 


.25 </)(-!, -1, 1/3, 1/3, -.7) + .75 <^(1,1, 1/3, 1/3, 0) 


e 


.5 (/)(-!, -1, 2/3, 2/3, 0) + .5 (/)(!, 1, 2/3, 2/3, 0) 


f 


.5 (/)(-!, -1, 2/3, 2/3, -.7) + .5 </>(.5, .5, 2/3, 2/3, 0) 


g 


.25 0(-l, -1, 2/3, 2/3, 0) + .75 ,^(1, 1, 2/3, 2/3, 0) 


h 


.25 0(-l, -1,2/3, 2/3, -.7) + .75 <^(.5, .5, 2/3, 2/3, 0) 



infected by outliers according to the following rules 

• cases (a) and (c), well separated components and independence: we introduced 
(e.g.. Figure 1) mw outliers equally spaced on the first quadrant of a circle centered 
on pi and m {l—w) outliers equally spaced on the first quadrant of a circle centered 
on /X2, both the circles having radius r = vi = V 2 = 1.00(1.17,1.34). The circle 
of radius 1.00 (1.17, 1, 34) gives the region wherein the two gaussian bivariates fall 
with common probability p = 0.9889 (0.9978, 0.9997). 

• cases (b) and (d), well-separated components and dependence: we introduced 
(e.g.. Figure 1) m (1 — w) outliers equally spaced on the first quadrant of a circle 
centered on p 2 and, as we did in cases (a) and (c), with T 2 = 1.00 (1.17, 1.34) and 
mw outliers equally spaced on the first quadrant of an ellipse centered on /ii, 
having length of the major axis ri = 1.30 (1.52, 1.75), so that the ellipse gives the 
region wherein the first component of the mixture falls with the same probability 
as the second component, i.e., p == 0.9889 (0.9978,0.9997). 

• cases (e) and (g), close components and independence: we introduced (e.g.. Figure 
2) mw outliers equally spaced on the third quadrant of a circle centered on pi 
and m {l — w) outliers equally spaced on the first quadrant of a circle centered on 
/i2. The circles have radius r = vi = V 2 = 2.00 (2.33, 2.69) so they meet the same 
probability regions of cases (a) and (b). 

• cases (f) and (h), close components and dependence: we introduced (e.g.. Figure 
2) m (1 — w) outliers equally spaced on the first quadrant of a circle centered on p 2 
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Case a 



Case b 




CO J 



I — I — \ — I — I — \ — I I — I — I — \ — I — I — I 

-3-2-10123 -3-2-10123 



Figure 1. Experimental conditions of simulation with m = 4 
outliers for any position considered. 



with V 2 = 2.00 (2.33, 2.69) and mw outliers equally spaced on the third quadrant 
of an ellipse centered on /ii with major axis ri = 2.61 (3.04,3.50), so they meet 
the same probability regions of previous cases. 

We chose cases (a), (c), (e) and (g) in order to study the behavior of L 2 E 
versus MLE under the assumption of independence of the components of both the 
bivariate gaussian densities; in these situations the variance-covariance matrix of 
equation (2.4) becomes, for i = 1, 2, = (7^ J where cri = (crf^, < 7 ^ 2 )^? / is a 2 x 2 

identity matrix and Ol^e is a random vector of dimension 9. In case (a), where the 
components of the mixture are well separated, we were interested in robustness of 
MLE and L 2 E with respect to location and scale parameters, while in ca^e (e), 
where the two components of the mixture are close to each other, we focused our 
attention to scale parameters estimation. 
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Case e Case i 




--4 -2 0 2 4 ^4 _2 0 2 4 

Case g Case h 




-4 -2 0 2 4 ^4 -2 0 2 4 



Figure 2. Experimental conditions of simulation with m = A 
outliers for any position considered. 



In cases (b), (d), (f) and (h), where one of the bivariate gaussian densities 
of the mixture has dependent components (p ^ 0), we were mainly interested in 
robustness of MLE and L 2 E with respect to the correlation parameter. We recall 
that in these cases the random vector Ol^e of equation (2.4) has dimension 11. 

Comparing the results arising from cases (c), (d), (g) and (h) with the ones 
from cases (a), (b), (e) and (f), we observed the influence of the mixing probability 
on the estimates. 

Before going on, it is worthwhile to underline that in order to obtain the 
MLE of the parameters of the mixture (2.3) we maximized the likelihood function 
resorting to numerical algorithms; the random vector of ML estimates looks like 



6mle= argmax Si) + (1 - w) S 2 ). (3.1) 

^2, Si, S 2 
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We resorted also to the E-M algorithm (e.g., Flury, 1997; Gray, 1994) obtain- 
ing results that are close to the ones of the direct maximization of (3.1). We chose 
however the latter since the E-M algorithm is more sensitive to the initial guesses. 

All MLE and L 2 E were performed using the nlim routine implemented in 
R computing environment. The convergence of the algorithm to local optimal 
solutions was (almost always) ensured setting the vector of initial guesses equals 
to the vector of true parameters. 



4. Some Results 

For each of the k = 1,000 samples we estimated simultaneously location, scale, 
correlation parameters and mixing probability according to both ML and L 2 cri- 
teria. Comparison between the estimators has been done on the basis of average 
and standard error of samples results. 

At a first glance we may state that MLE and L 2 E of the mixing probability 
is in average accurate and this for all the experimental situations. 

In all the cases we considered, the variability of the two estimators is of 
the same order for location parameters, while for scale parameters and mixing 
probability L 2 E shows systematically a somewhat greater variability than MLE. 

Comparing the results arising from cases (a), (b), (e) and (f) with those from 
cases (c), (d), (g) and (h), we observed that the mixing probability infiuences the 
variability of the estimates of all parameters. Obviously for the first component 
of the mixture (for which w = .25) the variability of the estimates is greater than 
the one we observed in the cases for which w = .5. 

In the following we report the results of the simulation, in terms of average 
of the two estimators, for cases (a), (b), (e) and (f), since the same considerations 
could be made for cases (c), (d), (g) and (h). 

• case (a): the L 2 criterion shows a good performance in terms of location and 
scale parameters estimation and seems to be less sensitive to outliers than the ML 
criterion, which tends to overestimate the parameters as m increases. Furthermore, 
for a fixed m, the ML criterion tends to overestimate the parameters and especially 
the variances as r increases, while L 2 estimates don’t seem to be infiated by the 
position of the outliers. Obviously, in this situation, the two estimators behave in 
the same way with respect to the two components of the mixture. Table 1 reports 
the means and the standard errors (in brackets) of the estimates arising from 
simulation and this for every m and every r. 



• case (e): the means of the ML estimates always overestimate the true parameters 
and as in case (a) this is due to the presence and positioning of outliers and also 
to the fact that in this case the two components of the mixture are close to each 
other. In average L 2 criterion overestimates too the variances, but in a mild way. 
In fact the vector of L 2 E for m = 32 and r = 2.69 (the “worst” situation we 
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Table 1. Case (a): average of the estimates of the parameters of 
the mixture (standard errors in brackets). 



r = 1.00 


All Ai2 <5-11 6-12 A2I A22 <5-21 0-22 1 


m = 8 


L 2 E 

MLE 


-0.998 

(0.031) 

-0.976 

(0.031) 


-1.001 

(0.033) 

-0.977 

(0.033) 


0.335 

(0.029) 

0.353 

(0.021) 


0.335 

(0.027) 

0.353 

(0.021) 


0.999 

(0.035) 

1.023 

(0.033) 


1.003 

(0.035) 

1.026 

(0.032) 


0.337 

(0.030) 

0.352 

(0.022) 


0.339 

(0.030) 

0.354 

(0.022) 


0.499 

(0.015) 

0.500 

(0.007) 


m = 16 


L 2 E 

MLE 


-0.999 

(0.034) 

-0.953 

(0.030) 


-0.994 

(0.033) 

-0.952 

(0.031) 


0.342 

(0.029) 

0.368 

(0.019) 


0.346 

(0.030) 

0.369 

(0.021) 


1.002 

(0.033) 

1.045 

(0.030) 


1.005 

(0.031) 

1.049 

(0.030) 


0.346 

(0.029) 

0.369 

(0.019) 


0.345 

(0.027) 

0.370 

(0.019) 


0.500 

(0.015) 

0.500 

(0.005) 


m = 32 


L 2 E 

MLE 


-0.993 

(0.033) 

-0.914 

(0.029) 


-0.993 

(0.033) 

-0.914 

(0.028) 


0.363 

(0.034) 

0.396 

(0.019) 


0.363 

(0.036) 

0.396 

(0.019) 


1.009 

(0.034) 

1.087 

(0.029) 


1.012 

(0.034) 

1.088 

(0.030) 


0.364 

(0.037) 

0.395 

(0.019) 


0.363 

(0.032) 

0.395 

(0.017) 


0.499 

(0.012) 

0.500 

(0.004) 


r = 1.17 


1 


m = 8 


L 2 E 

MLE 


-1.001 

(0.032) 

-0.972 

(0.033) 


-0.999 

(0.032) 

-0.971 

(0.032) 


0.334 

(0.027) 

0.360 

(0.021) 


0.338 

(0.029) 

0.361 

(0.021) 


1.004 

(0.033) 

1.032 

(0.031) 


0.999 

(0.034) 

1.028 

(0.032) 


0.335 

(0.028) 

0.362 

(0.021) 


0.336 

(0.030) 

0.362 

(0.022) 


0.501 

(0.014) 

0.500 

(0.006) 


m = 16 


L 2 E 

MLE 


-0.998 

(0.033) 

-0.945 

(0.030) 


-1.000 

(0.032) 

-0.946 

(0.030) 


0.342 

(0.028) 

0.386 

(0.019) 


0.340 

(0.029) 

0.385 

(0.019) 


0.999 

(0.033) 

1.055 

(0.029) 


1.000 

(0.033) 

1.055 

(0.030) 


0.339 

(0.031) 

0.386 

(0.020) 


0.344 

(0.031) 

0.386 

(0.020) 


0.500 

(0.015) 

0.500 

(0.008) 


m = 32 


L 2 E 

MLE 


-0.998 

(0.031) 

-0.898 

(0.027) 


-0.996 

(0.033) 

-0.897 

(0.028) 


0.357 

(0.032) 

0.422 

(0.018) 


0.356 

(0.031) 

0.420 

(0.019) 


1.002 

(0.033) 

1.102 

(0.029) 


1.002 

(0.032) 

1.103 

(0.030) 


0.358 

(0.027) 

0.422 

(0.016) 


0.357 

(0.030) 

0.422 

(0.017) 


0.500 

(0.013) 

0.500 

(0.007) 


r = 1.34 




m = 8 


L 2 E 

MLE 


-1.001 

(0.034) 

-0.968 

(0.035) 


-1.001 

(0.034) 

-0.968 

(0.033) 


0.333 

(0.027) 

0.371 

(0.020) 


0.334 

(0.027) 

0.372 

(0.021) 


0.998 

(0.033) 

1.031 

(0.033) 


1.004 

(0.035) 

1.034 

(0.035) 


0.336 

(0.030) 

0.375 

(0.023) 


0.337 

(0.028) 

0.376 

(0.022) 


0.499 

(0.013) 

0.499 

(0.009) 


m = 16 


L 2 E 

MLE 


-1.002 

(0.034) 

-0.939 

(0.029) 


-1.000 

(0.033) 

-0.937 

(0.028) 


0.342 

(0.033) 

0.403 

(0.018) 


0.340 

(0.031) 

0.402 

(0.018) 


1.002 

(0.034) 

1.061 

(0.029) 


1.002 

(0.035) 

1.062 

(0.030) 


0.341 

(0.032) 

0.409 

(0.019) 


0.343 

(0.032) 

0.408 

(0.018) 


0.500 

(0.012) 

0.499 

(0.010) 


m = 32 


L 2 E 

MLE 


-1.001 

(0.034) 

-0.884 

(0.037) 


-0.998 

(0.035) 

-0.884 

(0.036) 


0.359 

(0.035) 

0.452 

(0.022) 


0.356 

(0.031) 

0.451 

(0.023) 


1.003 

(0.027) 

1.116 

(0.025) 


0.997 

(0.026) 

1.114 

(0.025) 


0.353 

(0.023) 

0.455 

(0.021) 


0.354 

(0.035) 

0.456 

(0.021) 


0.502 

(0.011) 

0.499 

(0.015) 



considered) is similar to the vector of MLE for m = 8 and r = 2 (the “best” 
situation we considered). Figure 3 shows the contour plot of the mixture density 
with parameters the means of the estimates of our simulation according to L 2 and 
ML criteria, for m = 32 and r — 2.69. We finally observe that in this case too L 2 
estimates don’t seem to be infiated by the position of the outliers; in fact, for a 
fixed m, we obtain the same estimates as r increases. Conversely, for a fixed m, 
ML estimates increase as r increases. 

• case (b): for the second component of the mixture (for which p 2 = 0) both MLE 
and L 2 E for location and scale parameters behave in the same way as outlined 
for case (a). The estimates of p 2 are in average more accurate if we resort to the 
L 2 criterion. The MLE for p 2 show a dependence between the components of the 
bivariate gaussian density that increases as both m and r increase (they range 
between .107 and .320). 

For the first component of the mixture (for which pi = —.70) the ML crite- 
rion tends to overestimate both location and scale parameters. In particular, see 
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Figure 3. Contour plots of case (e) with m = 32 and r = 0.69. 



Table 2, it seems that /in and an are more overestimated than /ii 2 and ai 2 , but 
this is probably due to the position of the outliers on the ellipse. 

On the other hand, L 2 estimates of pi are in average very accurate, while 
MLE shows an increasing independence between the components of the bivariate 
normal density as both m and r increase. 



MLE Lae 






0*.l ni. (-1.CJJ.-1 ia. 70. 7 *.- 69 . 51 . ». 75 . 76 . »] 



Figure 4. Contour plots of case (f) with m = 32, ri = 3.50 and 
r2 = 2.69. 



• case (f): in this situation some problems arise in estimating the vectors of pa- 
rameters, since the two components of the mixture are very close to each other 
and for the first of them we set pi = -.7. The main problem in this case is finding 
the right values of the initial guesses of the numerical algorithm for solving equa- 
tions (2.4) and (3.1). For all previous cases the true parameters vector was a good 
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Table 2 . Case (b): average of the estimates of the parameters of 
the first component of the mixture (standard errors in brackets). 



n = 


1.30 


An 


Ai2 


<5-11 


<712 


Ai 


m = S 


L 2 E 


-1.000 


-0.996 


0.333 


0.331 


- 0.688 






(0.032) 


(0.031) 


(0.028) 


(0.028) 


(0.064) 




MLE 


-0.982 


-0.988 


0.338 


0.332 


-0.627 






(0.032) 


(0.032) 


(0.023) 


(0.022) 


(0.055) 


m = IQ 


L 2 E 


-0.996 


-1.001 


0.341 


0.337 


-0.675 






(0.033) 


(0.031) 


(0.029) 


(0.028) 


(0.074) 




MLE 


-0.965 


-0.980 


0.349 


0.335 


-0.575 






(0.032) 


(0.030) 


(0.021) 


(0.021) 


(0.056) 


m = 32 


L 2 E 


-0.990 


-0.995 


0.352 


0.346 


-0.633 






(0.033) 


(0.033) 


(0.029) 


(0.029) 


(0.065) 




MLE 


-0.935 


-0.963 


0.359 


0.338 


-0.499 






(0.030) 


(0.030) 


(0.020) 


(0.020) 


(0.061) 


n = 


1.52 




m = 8 


L 2 E 


-0.997 


-1.003 


0.336 


0.335 


-0.701 






(0.033) 


(0.034) 


(0.029) 


(0.031) 


(0.068) 




MLE 


-0.978 


-0.988 


0.346 


0.336 


-0.606 






(0.031) 


(0.032) 


(0.023) 


(0.022) 


(0.061) 


m = 16 


L 2 E 


-0.999 


-1.001 


0.339 


0.341 


-0.693 






(0.032) 


(0.033) 


(0.029) 


(0.030) 


(0.068) 




MLE 


-0.959 


-0.977 


0.358 


0.341 


-0.541 






(0.031) 


(0.032) 


(0.023) 


(0.022) 


(0.061) 


m = 32 


L 2 E 


-0.999 


-0.998 


0.350 


0.349 


-0.679 






(0.033) 


(0.034) 


(0.029) 


(0.031) 


(0.067) 




MLE 


-0.925 


-0.955 


0.376 


0.348 


-0.450 






(0.028) 


(0.029) 


(0.020) 


(0.020) 


(0.062) 


n = 


1.75 




m = S 


L 2 E 


-1.001 


-0.998 


0.336 


0.335 


-0.698 






(0.031) 


(0.032) 


(0.029) 


(0.029) 


(0.063) 




MLE 


-0.977 


-0.985 


0.351 


0.340 


-0.583 






(0.031) 


(0.032) 


(0.022) 


(0.023) 


(0.060) 


m = 16 


L 2 E 


-1.001 


-0.998 


0.341 


0.339 


-0.693 






(0.033) 


(0.034) 


(0.030) 


(0.028) 


(0.066) 




MLE 


-0.954 


-0.973 


0.369 


0.347 


-0.495 






(0.030) 


(0.031) 


(0.021) 


(0.021) 


(0.062) 


m = 32 


L 2 E 


-1.002 


-0.998 


0.350 


0.351 


-0.692 






(0.031) 


(0.030) 


(0.029) 


(0.029) 


(0.065) 




MLE 


-0.915 


-0.950 


0.396 


0.362 


-0.392 






(0.028) 


(0.028) 


(0.020) 


(0.020) 


(0.063) 



choice for the initial guesses, but in this case it gives less performing results. In 
our simulation study we chose the sample moments as initial guesses. 

Both the ML and L2 estimates of the location and scale parameter of the two 
components of the mixture behave in general as in case (a). The estimates of p\ 
and p2 are more accurate if we resort to L2£’, and this for any outliers consistency 
and positioning. This behavior, which does not differ from the one obtained in 
case (b), is outlined in Figure 4 . 
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5. Conclusions and Future Work 

In this paper we outlined an approach in estimating the parameters of a mixture of 
two bivariate gaussian densities based on the L 2 criterion. Prom a simulation study, 
where a particularly care has been taken in outliers positioning and consistency, 
to match some typical situations that frequently arise in scientific areas (e.g., 
engineering, chemical, pharmaceutical), it follows that the L 2 approach generally 
seems to be robust over a wide case of data contamination with respect to ML 
estimates. Since our results are limited to the specific experimental conditions we 
chose for the simulation, it would be interesting to study, in a future work, the 
behavior of the L 2 estimator by increasing the fraction of outliers and changing 
the type of contamination. 

An amazing subject would be exploring the possibility of constructing pa- 
rameters confidence intervals using asymptotic theory or bootstrap techniques. 

Further features on the topic would be, following the framework proposed by 
Durio and Isaia (2003) in the case of regression modelling by L 2 , the improvement 
of a quick rule to identify and simultaneously quantify clusters in multivariate 
data. 

Since the L 2 estimate criterion is helpful to point out the presence of outliers 
if evident discrepancies arise between L 2 E and MLE^ we may exploit this con- 
sideration in order to implement a useful algorithm to determine the appropriate 
number of components of a mixture. 
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Robust PCR and Robust PLSR: 
a Comparative Study 

S. Engelen, M. Hubert, K. Vanden Branden and S. Verboven 



Abstract. Principal Component Regression (PCR) and Partial Least Squares 
Regression (PLSR) are the two most popular regression techniques in chemo- 
metrics. They both fit a linear relationship between two sets of variables. The 
responses are usually low-dimensional whereas the regressors are very numer- 
ous compared to the number of observations. In this paper we compare two 
recent robust PCR and PLSR methods and their classical versions in terms 
of efficiency, goodness-of-fit, predictive power and robustness. 

Mathematics Subject Classification (2000). 62F35; 62H20; 92E99. 

Keywords. Principal component analysis, principal component regression, par- 
tial least squares, robustness. 



1. Introduction 

Principal Component Regression and Partial Least Squares Regression are the two 
most popular regression techniques in chemometrics (Martens and Naes, 1998). 
Both methods can handle multicollinearity in the data and can be applied if there 
are more regressors than objects. Hence, it are the standard tools for multivari- 
ate calibration where the concentrations of certain constituents in samples are 
modelled and predicted from their spectra. 

PCR and PLSR are based on a bilinear model that explains the existence 
of a relation between a set of p-dimensional regressors and a set of g-dimensional 
response variables through A:-dimensional scores U with k <^p. More precisely, for 
z = 1, . . . , n with n the number of observations (xi,yi), we assume that 

Xi = X -j- Pp^kti + fi (1.1) 

Vi = y + A'q^kti + 9i- ( 1 - 2 ) 

Here x is the mean of the x-variables, y the mean of the ^/-variables, Pp^k the 

matrix of x-loadings and Ak,q represents the slope matrix in the regression of 

yi on ti. The superscript ' is used for the transpose of a vector or matrix, and 
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subscripts as in Pp^k indicate that the matrix P has p rows and k columns. The 
error terms are denoted by fi and gi. In terms of the original predictor variables, 
this bilinear model can be written as 

2/i = /3o + BqpXi + €i 

with 



/3o = y-B'gpX. (1.4) 

According to the bilinear model, both PCR and PLSR proceed in two major 
stages. In the first step, following (1.1), they summarize the high-dimensional 
observations Xi in scores U of dimension k p. The selection of the number of 
components k can be done by means of various different criteria, which will be 
discussed in Section 3. These k latent variables then become the regressors in the 
second step of the algorithm (1.2). Finally, estimates for B and /3 q are obtained 
via (1.3) and (1.4). 

The main difference between PCR and PLSR lies in the construction of the 
scores ti. In PCR the scores are obtained by extracting the most relevant informa- 
tion present in the x- variables by performing a principal component analysis on the 
predictor variables and thus using a variance criterion. No information concerning 
the response variables is yet taken into account. In contrast, the PLSR scores are 
calculated by maximizing a covariance criterion between the x- and ?/- variables. 
Hence also information present in the responses is used in the first stage of the 
algorithm. By construction we thus expect that PLSR requires less components 
than PCR. This has been confirmed by several studies (De Jong, 1993b), Prank 
and Friedman (1993), Martens and Naes (1998). 

In this paper we want to investigate how recent robust methods of PCR and 
SIMPLS behave with respect to each other in terms of efficiency, goodness-of-fit, 
predictive ability and robustness. In Section 2 we shortly describe the four methods 
involved in this paper: classical PCR (CPCR), robust PCR (RPCR), SIMPLS and 
robust SIMPLS (RSIMPLS). In Section 3 a comparison between the four methods 
is given on the basis of a simulation study. All four methods will be illustrated on 
a real example in Section 4 whereas Section 5 summarizes our conclusions. 



2. Robust Calibration Methods 

2.1. Principal Component Regression 

As explained in the introduction, CPCR starts by performing a Principal Com- 
ponent Analysis (PCA) on the x-variables. The PCA loading matrix Pp^k then 
contains the first k dominant eigenvectors of the empirical covariance matrix of 
the Xi^ and the scores satisfy U = P^. ^{xi — x). In the second step of CPCR mul- 
tiple linear least squares regression (MLR) is applied on the {U^yi) to obtain an 
estimate of the slope matrix Ak,q in (1.2). 
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In Hubert and Verboven (2003) a robust PCR method is proposed by ro- 
bustifying both steps of CPCR. First a robust PCA method is applied on the 
regressors. For low-dimensional data {p < n/2), the MCD estimator (Rousseeuw, 
1984) is used as a robust estimator of the covariance matrix of the and for 
high-dimensional data the ROBPCA method (Hubert et ah, 2004). This estima- 
tor combines projection pursuit techniques with robust covariance estimation in 
low dimensions. Next a robust regression method is applied. If there is only one 
response variable the reweighted LTS regression (Rousseeuw, 1984) is preferred, 
else the MCD regression (Rousseeuw et al., 2004) is performed. 

2.2. Partial Least Squares Regression 

We consider the SIMPLS algorithm (De Jong, 1993a) being the leading PLSR 
method because of its speed and efficiency. Let X = {{xi — and Y = 

{{Vi ~ yy}^=i be the centered data matrices. The first normalized PLSR weight 
vectors r\ and qi are obtained as linear combinations of X and Y that maximize 

cov{Xri,Yqi). 

The solution of this maximization problem is found by taking ri and q\ as the 
first left and right singular eigenvectors of Sxy = X'Y /(n- 1), the cross-covariance 
matrix of the x- and y-variables. For each observation the first coordinate of the 
score ti is computed as tn = x'ri. 

The other PLSR weight vectors Va and qa for a = 2, . . . , fc are obtained 
by imposing an orthogonality constraint to the elements of the scores. If we re- 
quire that Uatib = 0 for a / 6, a deflation of the cross-covariance matrix 
Sxy provides the solutions for the other PLSR weight vectors. This deflation is 
carried out by first calculating the x-loading Pa = Sxra/ir'^SxVa) with Sx the 
empirical variance-covariance matrix of the x- variables. Next an orthonormal base 
{ui, . . . , Ua} of {pi, . . . ,Po} is constructed and Sxy is deflated as 

S^y = - VaKS^-^) 

with sly = Sxy . In general the PLSR weight vectors Va and qa are obtained as the 
left and right singular vector of S^y. Finally, when the scores are A:-dimensional, 
MLR is performed of the responses Pi on these scores U. 

A robust method RSIMPLS has recently been developed in Hubert and Van- 
den Branden (2003). It starts by applying ROBPCA (Hubert et al., 2004) on the 
X- and y-variables in order to replace Sxy and Sx by robust estimates and then 
proceeds analogously to the SIMPLS algorithm. Similar to RPCR a robust regres- 
sion method (ROBPCA regression) is performed in the second stage. In Vanden 
Branden and Hubert (2004) it is proved that for low-dimensional data the RSIM- 
PLS approach yields bounded influence functions for the weight vectors Va and 
qaj and for the regression estimates. Also the breakdown value is inherited from 
the MCD estimator. 
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The computational complexity of ROBPC A and RSIMPLS is discussed in de- 
tail in Hubert et al. (2004) and Hubert and Vanden Branden (2003). The computa- 
tion time remains feasible due to the FAST-MCD algorithm (Rousseeuw and Van 
Driessen, 1999). To give an example, on a Pentium IV with 1.60 GHz, it requires 
in a full Matlab implementation approximately 7 seconds to perform RSIMPLS 
on a data set with n = 100, p = 500 and fc = 5. 



3. Experimental Study 

We will compare the efficiency, the goodness-of-fit (GOF), the predictive power 
and the robustness of CPCR, RPCR, SIMPLS and RSIMPLS by performing a 
simulation study on uncontaminated and contaminated data. Note that the ro- 
bustness of RPCR and RSIMPLS has also been shown through simulations in 
Hubert and Verboven (2003) and Hubert and Vanden Branden (2003), but there 
the emphasis was put only on the parameter estimation and not on the predictive 
performance of the methods. The experiments described in this section consider 
univariate responses {q = 1) which are mostly used in practice. We thus consider 
the regression model 

Hi ~ Po "b d" Ci 

with Bp^i = (^ 1 , . . . , Pp)'. The regression vector including the intercept is denoted 

as /3 = (/?o, ^p)'. 

3.1. Simulation Settings 

We compare the algorithms on high-dimensional data sets ^ 50,100 of size n = 50 
and p = 100 and low-dimensional data sets Xioo,6 of size n = 100 and p = 6. They 
were constructed according to the bilinear model (1.1) and (1.2): 

T ^ ^ 2 ( 02 , St) 

A = T/2,p + iVp(0p, 0.1/p) 

Y = T^2,i+iV(0,l) 

with O 2 = (0, 0)', Ik,p = Sij and ^ 2,1 = (1, 1)^- Furthermore we set Et = 

when p = 100 and Et = ^ ^ ^ ^ for p = 6. Hence the optimal number of 
components fcopt = 2. 

Next, contamination is added by replacing 10% of the observations by dif- 
ferent types of outliers. Denote Te,Xg, and as the contaminated parts of the 
data. 

1. Bad leverage points were constructed by substituting ~ AT2((15, 15)', Et) 
in equation (1.1): 

Vg = T^l2^p + iVp (Op, 0.1/p). 

Note that the corresponding p- values did not change. 





Robust PCR and Robust PLSR: a Comparative Study 



109 



2. Vertical outliers have uncontaminated x-values, but their y-values were 
changed by adjusting the error term in (1.2): 

Y, = TA2,i + N{15,0.1). 

For each situation, m = 100 data sets were generated and they were analyzed with 
fc = 1, 2 and 3 components. 

The efficiency of the considered methods is evaluated by means of the MSE 
of the estimated regression parameters /3. It is defined by 

^ m 

MSEk0) = 

^ 1 = 1 



where denotes the estimated parameter based on k components in the Ith 
simulation. The MSE indicates to what extent the slope and intercept are correctly 
estimated. So the goal is to obtain an MSE value close to zero. 

Next we want to study how well the methods fit the regular data points. 
Because of the simulation settings, we know exactly their indices which we store 
in the set Gr> Then we define the goodness-of-fit criterion as 

var 

GOFfc = 1 - , (3.1) 

var {yi) 

It (jr^ 



with ri^k the residual of the ith observation when k components are computed. 
The objective is to obtain a GOF close to 1. Note that this GOF is an adaptation 
of the robust B? proposed in Hubert and Verboven (2003) where Gr contains the 
non-outlying data points detected by the estimation procedure itself (RPCR or 
RSIMPLS). The value of k where the R^-cmve stabilizes is then selected as the 
optimal one. 

Finally, we measure the predictive ability of the methods by means of the 
Root Mean Squared Error (RMSE) as in Hubert and Vanden Branden (2003). 
First we generate a test set Gt of uncontaminated points with size rit = 50, and 
then we compute 



RMSEfc = 




nt 

i=l 



(3.2) 



with yi^k the predicted y- value of observation i from the test set when the regression 
parameter estimates are based on the training set (X, Y) of size n and k scores are 
retained. The optimal number of components is often selected as that k for which 
this RMSE value (or its cross-validated version) is minimal. 

The results of the simulations are listed in Tables 1-3 for the high-dimensional 
data sets and in Tables 4-6 for the low-dimensional situation. 
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Table 1. n = 50, p = 100, no contamination. 





CPCR 


Method 

RPCR SIMPLS 


RSIMPLS 


= 1 


MSE(/3) 


1.12 


1.14 


0.63 


0.65 




GOF 


0.69 


0.69 


0.81 


0.80 




RMSE 


1.81 


1.82 


1.50 


1.51 


CM 

II 


MSE(/3) 


0.16 


0.22 


0.23 


0.29 




GOF 


0.89 


0.89 


0.91 


0.90 




RMSE 


1.14 


1.15 


1.13 


1.15 


k = 3 


MSE(/3) 


0.20 


0.29 


3.19 


0.91 




GOF 


0.89 


0.89 


0.97 


0.91 




RMSE 


1.14 


1.16 


1.27 


1.19 



Table 2. n = 50,p = 100, 10% bad leverage points. 





CPCR 


Method 

RPCR SIMPLS 


RSIMPLS 


k = l 


MSE(/3) 


1.93 


1.14 


1.92 


0.65 




GOF 


0.17 


0.71 


0.31 


0.82 




RMSE 


3.07 


1.84 


2.85 


1.50 


k = 2 


MSE(/3) 


2.49 


0.22 


2.80 


0.31 




GOF 


0.39 


0.89 


0.45 


0.91 




RMSE 


2.68 


1.15 


2.68 


1.16 


CO 

II 


MSE0) 


2.78 


0.28 


20.24 


1.02 




GOF 


0.41 


0.89 


0.80 


0.93 




RMSE 


2.68 


1.15 


3.01 


1.19 



3.2. Simulation Results 

When no contamination is added (Tables 1 and 4), the classical methods perform 
somewhat better than their robust versions, as we would expect. At the optimal 
fcopt = 2, CPCR and SIMPLS can hardly be distinguished. At high-dimensional 
data, the MSE(CPCR) is minimal, but the GOF and RMSE values are slightly 
in favor of SIMPLS. At low-dimensional data, we see almost no differences. When 
only one component is selected (fc = 1), SIMPLS outperforms CPCR noticeably. 
This confirms the findings of Frank and Friedman (1993) that PLSR constructs 
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Table 3. n = 50,p = 100, 10% vertical outliers. 





CPCR 


RPCR 


Method 

SIMPLS 


RSIMPLS 


k = l 


MSE(/3) 


3.39 


1.15 


3.06 


0.62 




GOF 


0.65 


0.69 


0.76 


0.81 




RMSE 


2.44 


1.81 


2.25 


1.48 


k = 2 


MSE(/3) 


2.68 


0.22 


14.97 


0.29 




GOF 


0.80 


0.89 


0.66 


0.90 




RMSE 


2.11 


1.15 


2.55 


1.15 


fc -3 


MSE(/3) 


3.37 


0.28 


60.56 


1.01 




GOF 


0.77 


0.89 


0.57 


0.92 




RMSE 


2.13 


1.16 


3.26 


1.20 



Table 4. n = 100, p = 6, no contamination. 





CPCR 


RPCR 


Method 

SIMPLS 


RSIMPLS 


k = l 


MSE(/3) 


1.06 


1.11 


0.27 


0.27 




GOF 


0.55 


0.54 


0.77 


0.77 




RMSE 


1.79 


1.81 


1.28 


1.27 


k = 2 


MSE(/3) 


0.02 


0.03 


0.03 


0.05 




GOF 


0.83 


0.83 


0.83 


0.83 




RMSE 


1.10 


1.11 


1.10 


1.11 


k = 3 


MSE(/3) 


0.10 


0.13 


0.51 


0.60 




GOF 


0.83 


0.83 


0.84 


0.84 




RMSE 


1.11 


1.11 


1.12 


1.13 



its components more efficiently than PCR. However, when we choose more com- 
ponents than required, here fc = 3, we notice that SIMPLS suffers much more 
from overfitting than CPCR. This is reflected in the large MSE values of SIMPLS. 
At high-dimensional data MSE 3 (SIMPLS) = 3.19 is even considerably larger than 
MSE 3 (RSIMPLS) = 0.91. 

If we add contamination, CPCR and SIMPLS clearly break down. The MSE 
of the regression parameter estimates increase drastically and even attain their 
minimum at A: = 1 (except for CPCR in Table 3). The GOF values are very low. 
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Table 5. n = 100, p = 6, 10% bad leverage points. 





GPCR 


Method 

RPCR SIMPLS 


RSIMPLS 




1 


MSE(/3) 


1.84 


1.06 


1.83 


0.24 






GOF 


0.10 


0.54 


0.13 


0.78 






RMSE 


2.51 


1.81 


2.49 


1.26 


k = 


2 


MSE(;9) 


2.04 


0.03 


2.05 


0.05 






GOF 


0.18 


0.83 


0.18 


0.83 






RMSE 


2.42 


1.11 


2.42 


1.11 


k = 


3 


MSE(/3) 


2.56 


0.17 


4.52 


0.65 






GOF 


0.19 


0.83 


0.21 


0.83 






RMSE 


2.43 


1.11 


2.47 


1.13 



Table 6. n = 100, p = 6, 10% vertical outliers. 





CPCR 


Method 

RPCR SIMPLS 


RSIMPLS 


k = 


1 


MSE(/3) 


3.31 


1.10 


2.55 


0.24 






GOF 


0.52 


0.53 


0.74 


0.78 






RMSE 


2.39 


1.81 


2.04 


1.25 


k = 


2 


MSE(/3) 


2.35 


0.03 


2.41 


0.05 






GOF 


0.78 


0.83 


0.78 


0.83 






RMSE 


1.97 


1.10 


1.96 


1.11 


k = 


3 


MSE(/3) 


4.46 


0.13 


10.79 


0.62 






GOF 


0.75 


0.83 


0.68 


0.84 






RMSE 


2.02 


1.11 


2.16 


1.13 



especially when the data contain bad leverage points. This shows that the regular 
data points are badly fitted. The high RMSE values indicate the low predictive 
ability of the classical methods. SIMPLS is more sensitive to vertical outliers than 
CPCR. This is probably due to the fact that the response variable is already used 
in the construction of the SIMPLS scores, contrary to the CPCR scores that only 
depend on the x- variables. 

The results of the robust methods on the other hand are almost identical 
to the uncontaminated case. RSIMPLS is again superior to RPCR for A; == 1 and 
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Figure 1 . The spectra of the octane data. 



fc = 2, but shows more overfitting when k = 3. Note that the GOF values of 
RSIMPLS are always higher than those of RPCR. Both GOF and RMSE appear 
to be good criteria to select k. We see that the differences GOF 3 — GOF 2 are very 
small compared to GOF 2 — GOFi. But we cannot conclude that we should choose 
k for which GOFk is maximal. On the other hand, the minimal value of RMSE is 
always reached at the correct k = 2. This suggests to select k such that RMSE/c 
is minimal. 



4. An Example 

In this section, we compute the GOF and the RMSE values of the two classical and 
the two robust calibration methods on a real data set. The octane data (Esbensen 
et al., 1994) consist of NIR absorbance spectra over p = 226 wavelengths ranging 
from 1102 to 1552 nm with measurements every two nm. For each of the n = 39 
production gasoline samples the octane number y was measured, so q = 1. 

This data set has been studied before in Hubert et al. (2004). It is well 
known that the octane data set contains six outliers to which alcohol was added. 
In Figure 1 we see that the spectra of those six samples clearly stand out. They 
were also detected by applying the robust PGA method ROBPCA to the 226 
regressors (Hubert et al., 2004). Therefore we put the set of regular observations 
Gr equal to the full data of 39 observations minus these six outliers. 

For each of the four methods we computed the goodness-of-fit index as defined 
in (3.1). Applying (3.2) is not possible here because a test data set is beyond our 
reach, and given the small number of observations, we do not want to split the 
data set into a training and a test data set. Therefore, we use the cross- validated 
R-RMSECV value as proposed in Hubert and Verboven (2003) and Hubert and 
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Vanden Branden (2003). It is defined as 



R-RMSECVfe = 




(4.1) 



Here y-i^k represents the predicted y- value for observation i based on k components 
when observation i was left out of the estimation of the regression parameters. Note 
that when the outliers are not know in advance, the set of regular observations 
Gr is determined by the calibration method itself. Hence different sets could be 
obtained for e.g. RPCR and RSIMPLS. This would make it difficult to compare 
R-RMSECV values of several methods. 

We evaluate criteria (3.1) and (4.1) for k = 1,...,4 components. The re- 
sults are summarized in Table 7. For fc = 1 the classical methods perform very 



Table 7. Analysis of the octane data. 





CPCR 


Method 

RPCR SIMPLS 


RSIMPLS 


k = 


1 


GOF 


-0.04 


0.81 


0.13 


0.82 






R-RMSECV 


2.08 


0.93 


1.92 


0.90 


k = 


2 


GOF 


0.80 


0.98 


0.85 


0.98 






R-RMSECV 


0.94 


0.31 


0.81 


0.32 


k = 


3 


GOF 


0.98 


0.98 


0.98 


0.99 






R-RMSECV 


0.31 


0.30 


0.30 


0.30 


fc = 


4 


GOF 


0.98 


0.98 


0.98 


0.99 






R-RMSECV 


0.31 


0.30 


0.29 


0.29 



badly. For CPCR this even results in a negative GOFi(RPCR) = —0.04. Also 
GOFi(SIMPLS) = 0.13 is very small compared to the robust values. The 
R-RMSECV values of CPCR and SIMPLS are approximately twice as high than 
those of RPCR and RSIMPLS. 

When we retain k = 2 components, the GOF values of the classical methods 
increase considerably, and their R-RMSECV values have been more than halved, 
yielding results that are comparable with the robust ones for fc = 1. RPCR and 
RSIMPLS clearly improve their fit with a very high GOF 2 = 0.98 and R-RMSECV 
values of 0.31 resp. 0.32. 

For k = 3 the classical values correspond with the robust ones for fc > 2. 
This clearly shows that the first component of CPCR and SIMPLS is completely 
determined by the outliers and it confirms the conclusion in Tenenhaus (1998) to 
retain the second and the third component of SIMPLS. The results obtained with 
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RPCR and RSIMPLS remain stable from fc = 2 on, so these two methods suggest 
to retain kopt = 2 components. Note that here the R-RMSECV value is not at its 
minimal value, but it does not hardly change when k is increased. 

5. Conclusions 

Both the simulation study and the analysis of a real data set show that CPCR and 
SIMPLS are very sensitive to outliers in the data, whereas RPCR and RSIMPLS 
can resist several types of contamination. 

When the correct number of components is used in the calibration, RPCR 
and RSIMPLS are comparable in terms of efficiency, goodness-of-fit, predictive 
power and robustness. For smaller fc, RSIMPLS is to be preferred, whereas RPCR 
is less sensitive to overfitting when a larger set of components is selected. 

Finally, the proposed GOF and RMSF/R-RMSFCV measures are shown to 
be good indicators to select the optimal number of components. To speed up the 
heavy computations involved in the cross- validated R-RMSFCV, we are currently 
developing faster algorithms. This will allow to perform fast and robust model 
selection in multivariate calibration. 

All the methods described in this paper can be downloaded from the web 
sites http : //www . wis . kuleuven . ac . be/stat/ robust . html and 
http : //www . agoras . ua . ac . be/ as part of the Matlab library for Robust Calibra- 
tion (Verboven and Hubert, 2004). 

Note that in this paper we have concentrated on calibration methods that 
are particularly useful for small data sets in high-dimensions, which are very com- 
mon in chemometrics, food science and bioinformatics. Successful applications of 
robust PCA in bioinformatics are e.g. presented in Model et al. (2002) and Hubert 
and Fngelen (2004). One of the referees wondered whether they also could be ap- 
plied to problems in computer vision where both the number of samples and the 
number of variables can be very large. Currently ROBPCA is being implemented 
by Darren Cosker (3D Vision and Geometry, Department of Computer Science, 
Cardiff University, UK) to build statistical models of shape and appearance, i.e. 
Active Shape Models (ASM) and Active Appearance Models (AAM) (Cootes et 
al., 1998). In particular the method is used for building a model of mouth tracking. 
Other robust methods in computer vision have e.g. been proposed in Skocaj et al. 
(2002), Xu and Yuille (1995), De la Torre and Black (2001), and Stewart (1999). 
For an overview of the use of high-breakdown methods in computer vision, see 
e.g. Meer (2004). 
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Analytic Estimator Densities for Common 
Parameters under Misspecified Models 

P.J. Hingley 



Abstract. An expression is given for the exact probability density function 
of the parameter values that maximize the likelihood of a statistical model, 
where the data generating model is allowed to differ from the estimation 
model. The density can be used to study the robustness of estimation of al- 
ternative hypothetical models. It is described for curved exponential families, 
then specifically for gamma distribution models and for nonlinear regression 
models. An example is given in the context of alternative models for data 
from the biochemical ELISA test method. Finally an indication is given of 
how a robustness index can be calculated to assess the effects of estimation 
of a common parameter vector under a wrong model. 

Mathematics Subject Classification (2000). Primary 62B10, 62-02; Secondary 
62B15, 62E15, 62F10, 62G35, 62M10, 62P10. 

Keywords. ELISA, exact estimator density, model robustness, nonlinear re- 
gression, small sample properties. 



1. Introduction 

How can the joint density of the maximum likelihood estimates (MLE) from an 
estimation model be specified, when the data generating model is different? In 
this case the MLE are termed quasi maximum likelihood estimates (QMLE). The 
QMLE still maximize the likelihood with respect to the estimation model, but also 
minimize the Kullbach-Leibler information criterion between the estimation model 
and the data generating model (White, 1994). Here an exact analytic expression 
will be developed for the probability density function of the QMLE, that can be 
applied to models that are nonlinear with respect to their parameters. It can be 
used for any sample size, because its derivation does not depend on asymptotic 
properties. 

For estimation models with normal error structures, that are linear with re- 
spect to their parameters, QMLE have an exact multivariate normal distribution. 
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But for nonlinear models, even where estimation and data-generating models are 
identical, the vector of QMLE only has an asymptotic multivariate normal distri- 
bution. Experimental designs with finite numbers of sampling points often lead to 
markedly non- normal distributions (Bates and Watts, 1988). The proposed method 
is an alternative to carrying out simulation studies. 

The method that will be developed here may be of particular use to scien- 
tists who need to take account of the possibility that several possible underlying 
mathematical models could be generating the data in their experiments. 

2. Technique for Estimator Densities (TED) 

Consider the n members of a sample as a (nxl) vector u;, ^o(^) as the true 
density of and gi{w\6) as a presumed density with (pxl) parameter vector 6 
to be estimated. The log likelihood corresponding to gi{w\6) is l{0\w). The space 
of w is W, and the space of 6 is Q. A vector of independent variables 2 : can, if 
necessary, be introduced to cope with the regression situation. 

Regularity conditions are (with respect to 6 in each case, and throughout 
W and 0): gi{w\0) is continuous; l{9\w) is twice differentiable, possesses a single 
maximum and has no other turning points. The QMLE 6 is then given by the same 
expression as that for the MLE in the usual situation, = 0, (where ' 

indicates differentiation with respect to 0). The space of 0 is 0, a subspace of 0. 
Consider a (pxl) vector T. 

T{0,0\w) = l\0\w) - l'{0,w), (2.1) 

The parameter 0* is fixed at an arbitrary value (it is not the asymptotic limit of 
0, e.g., as represented by White, 1994, as 0*). 

The following Theorem 2.1 and Corollary 2.2 show that W can be divided 
into distinct subspaces Wg, corresponding to the different values of so that w 
determines T{0, 0*, u;) within each subspace. For continuously distributed data, W 
is made up of an uncountably large number of such subspaces Wg. 

Theorem 2.1. For fixed 0* , each possible value of 0 within 0 corresponds to a 
distinct subspace Wg. 

Proof. It will be shown that the converse of the theorem implies a contradiction. 
For fixed 0*, consider two unequal values of 0, denoted 0\ and 02 ^ with overlapping 
subspaces Wg-^ and Wg^. Call the common shared part of these subspaces 

Pick any value of w from and call it Wc- Then, from (2.1), T(0i,0*,if;c) = 

l'{0*,Wc) = T{02, 0 *,Wc)- According to the regularity conditions, /(0, Wc) possesses 
a single maximum over W at 0, at which point l'{0, Wc)\g^§ = 0 and T = l'{0*,Wc)- 
Therefore §i and §2 are both equal to the unique 0. This contradicts inequality for 
01 and 02 , and so Wg^ and Wg^ are distinct. □ 

Corollary 2.2. If 0* and w are fixed, T{0,0*,w) is also fixed. 
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The following theorem establishes the main result (2.2) which gives the exact 
density for 6 . 

Theorem 2.3. Assume that specifies one or several distinct manifolds, that run 
over a single open (n — p) dimensional subset ofW. Then the density of 6 is given 
by 

g{9) = -%(^A.)](0), (2.2) 

where j { 6 , w) = —l"{ 6 ,w) is the observed information. 

The left-hand term in (2.2) is a conditional expectation, that is conditional on 
6 = 0 and is taken with respect to w over gi{w\0). The second term represents the 
density g^j .0 '^hich the known vector 6 * = 0 , and hence t = 0 by (2.1). 



Proof. Part 1 . Special case of a unique Wq for each 6 \ 

Imagine initially that, for each 6 in 0, contains a single distinct vector, Wg 
say. For each possible 0, the correspondence to Wg is determined by the estimation 
model gi. The conditional density of 0, for fixed 0*, is given by a standard change 
of variable argument, using a Jacobian, as follows. 

mo*) = (2.3) 

The density g^j.^ g^ {t) is the density of a transform of Wg alone under 
the data generating model go{w), even if some other value of w, say with 
corresponding QMLE 0+, gives T{0~^,0*,w^) = T{0,0*,Wg). Theorem 2.1 shows 
that it is not possible for 0 ~^ = 0 . 



Part 2 . Usual case of one or more Wg for each 6 : 

For each 0 in 0, if Wg contains one or more vectors w, then the assignment of w 
to Wg depends upon the estimation model g\. For fixed 0*, each w value within 
Wg corresponds to 6 (by Theorem 2.1), and specifies T{6,6*,w) (by Corollary 2), 
although T{0,0'^,w) can vary between w values within Wg. 

The density g{0\0*) can be found as the conditional expectation under 
gi(w\6 = 6) of (2.3) over Wg. Taking a {n—p) coordinate basis on the manifold Wg, 
in coordinates U 2 , . . . , ^n-p == this conditional expectation can be written as 
a ratio of integrals with respect to variations over v. 



Iw~ 9 [t( 9 e> «;M)](*) • \dTldB\\g^Q.gi{w{v)\e = 6>) . ||w; (t;)|| dv 

g{e\e*) = . . (2.4) 

Sw~ 9i{w{v)\0 = 0 ) .\\w'{v)\\dv 



Giv) 



The density g^j .0 g^ w(v))](^) density of a transform of w{v) alone under 
^o(^)- The term dT/d 6 can be evaluated from (2.1) as —l"{0,w{v)) (the second 
derivative of I wrt 0 ), and \l"{0, u;(r’))| = \j{0, u^(t'))!. If the subspace Wg runs over 
more than one manifold, then the top and bottom of expression (2.4) are split into 
the sums of the corresponding integrals for each manifold before taking the ratio. 
The same (n — p) coordinate basis is taken over all such manifolds. 
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Part 3 : The unconditional density g{6): 

In equation (2.4) it has been assumed that 0* is fixed and the density w)](^) 

has been used to develop the conditional density g{0\0*). A family of densities 
Fam[g{0\0*)] can be considered to exist for different values of 6*. 

Consider a mechanism for computing g{6) for a given value of 0. For each 
value of 0 for which g{0) is to be evaluated, select g{0\0* = 0) as that member 
of Fam[g{0\0*)] for which 0* = 0. From (2.1), 0 = 0 and 0* = 0 together imply 
T = 0. The term P[t( 0,0 .,x„(„))](O in (2-4) can now be written ff[T(e>.=e,^(„))](0)- 
Repeating this selection process for all the required values of 0, the corresponding 
terms g{0\0* = 0) taken together constitute a new density g{0) that is independent 
of 0\ 



9 { 0 ) = 



%(e, «,(«))] (0) • li(^>'fn(n))lle=e ■5i(«'(n)l^ "" • \\w'{v)\\dv 

/iv~ gi{w{v)\e = 9) .\\w'{v)\\dv 



( 2 . 5 ) 

Now T{0^0* = 0,w{v)) = 0 throughout follows that the factor 

i(;(^;))](0) is not a function of v and can be moved outside the integrand 
on the top line of (2.5). 



90 ) = 



fw~ . gi{w{v)\9 =: 9) .\\w'{v)\\dv 



Jw~ gi{w{v)\9 = 9).\\w'{v)\\dv 

e{v) 



'9lT{9,9*=e,w)]i^)' 



(2.6) 

The bracketed term in (2.6) is the conditional expectation of {\j{0^y^{v))\\Q^§), 
conditional on under gi, and can be written Euj[\j {0 ^ w)\\0 = 0]. □ 



Equation (2.2), detailed at (2.6), is the required expression for g{0). The 
exact density g{0) is invariant under linear transformation of w. It also transforms 
under one to one reparameterization of 0, to 0 say, as g{^) = \50{^)/5^\ g{0). 

It should be noted that E^[\j {0 , w)\\0 = 0] is the conditional expectation of 
the determinant of the observed information, conditional on 0. This is not generally 
the same as [{E^[\j{0, u;)|])|0 = 0], which is the unconditional expected information 
at 0. However, for some simple models there can be equivalence between these 
expectation terms (e.g., the examples in Section 3.2 and Section 4 below). 

It is usually unnecessary to evaluate the multidimensional integrals in (2.6), 
because in practice terms in w can be replaced by Ew[w\0 = 0], terms in w‘^ by 
Euj[w‘^\0 = 0], etc. For example, if the model gi{w\0) was normal N{h[0i]^02), 
terms proportional to w would be replaced by terms proportional to h[0i], and 
terms proportional to w'^ would be replaced by terms proportional to 02 + [h{0i]‘^. 
Also, when W is discretised, it is generally the case that the distribution of 0 is 
discretised and no conditional expectation term is required in (2.2). 
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In the special case that the data generating model is equivalent in form 
to the estimation model, equation (2.2) reduces to a version of a formula given 
by Skovgaard (1990) for the density of minimum contrast estimators. Hillier and 
Armstrong (1999) give an integral formula for the density of the MLE in this 
special case. 



3. Examples 

In this section it will be shown how TED can be applied when the estimation 
model comes from a curved exponential family. In this case T{6^6* = O^w) turns 
out to be a linear transform of a functional of u;, and so found 

relatively easily. 



3.1. Curved Exponential Families as Estimation Models 

Let the (px 1) parameter vector 6 appear in a (nx 1) canonical function 6(0, z) and 
in a (pxl) functional c(0, z), together with (nxl) functionals of the data a{w) and 



d{w)] all constrained to describe a valid density for w (Dobson, 1983). 

gi{w\9) = exp[a{wfb{e, z) + l(pxi)^c(6>, z) + (3.1) 

where ^ indicates transposition. If /(0, z) is the unconditional expectation of a(w), 
McCullagh and Nelder (1989) show the following. 

^ (^5 ^)(pxn) • /(^5 ^)(nxl) ~ ip 1 ^){pxn) • f (nxl)* (^*^) 

Therefore 

=b\e,z).{a{w) - f{e,z)), (3.3) 

T(0,r,u;) = b'iO^z) . ia{w) - f{e\z)). (3.4) 

Equation (3.4) shows that T{6,6*,w) is a linear transform of a{w). 

The conditional expected information term Eyj[\j{6^w)\\6 = 9] can be ob- 
tained after differentiating equation (3.3) again to give —j{6^w). 

- j(<^,^)(pxp) = l"{0,w) 

= z){pxpxn) ■ {a{w) - f(0, z))] - lf{0, 2)(px„) • {b'{0, z)f], (3.5) 



The left-hand part of (3.5) is the curvature of the estimation model (Barndorff- 
Nielsen and Cox, 1994, page 70), while the right-hand part is {Euj[\j {9 , w)\]) . 



3.2. Estimation of the Rate Constant for a Gamma Distribution using 
the Negative Exponential Distribution 

Consider i.i.d. data from a data generating model given by the gamma distribution, 
and an estimation model given by a negative exponential distribution. 

exp[-0o«^t] 



9o{wi) = 



V{a) 

gi{wi\9) = 9exp[-9wi 



Wi>0, i = 1, . . . 

> 0, i = 1, . . . ,n. 



n 



(3.6) 
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The shape parameter a takes a fixed value here and is not considered to be esti- 
mated. The estimation model can be restructured in the form (3.1), where in this 
case the quantities T, 0 and 0* are scalars. More precisely, a(w) = w, 6(0, z) = 
-6'l(nxi),l(pxi)^c(0,z) = l(pxi)^(log0)l(pxi),l(„xi)'^d(w) = 0 and 



I l(lxn)(^(nxl) l(nxl)l/^) 

T{e,9*,w) = -s + {n/e*), s = Y,Wi. 



The distribution of s is obtained from go{wi), using the reproductive property 
that the sum of a set of i.i.d. gamma variables itself has a gamma distribution. 

9 , expi ^>0 

i [na) 

Prom this, the distribution of T(0,0*,u;) is given as 



“ r(no;) 

In this case Ew[\j {0 , w)\\0 = 0] = [{Ew[\j {0 ^ w)\])\0 = 0] = n/0^. Prom (2.2) 



n 



exp(-0o[(n/0*) -t]); t < — . 



0(^) = 



exp[-n0o/0], 0 > 0. 



0((^^)+i)P(na) 

This density can be verified using other analytic methods. 



(3.7) 



3.3. Nonlinear Regression Models with Known Homoscedastic Normal Errors 

An expression will be developed for the density of a parameter vector estimated 
from a nonlinear regression model, where data generating model and estimation 
model are distinct, but both are of a multivariate normal form 
Here //(z) is the (nxl) vector of mean responses and E is the known (n x n) 
covariance matrix. In this application of a curved exponential family model, it is 
convenient to assume that enough previous data have been analyzed to give a good 
estimate of E, or that the whole method is to be applied conditional on E. 

Assume that the data generating model go{w\z) has ii{z) = /o(^) and the 
estimation model gi{w\0^ z) has ii{z) = /(0, z), where /(0, z) is nonlinear in 0. The 
estimation model can be restructured in the form (3.1) with a(w) = w,b{0,z) = 
'^~^f{0,z),l'^c{9,z) = -lf{9,z)'^'E~'^f{9,z),l'^d{w) = -i[nlog(27r) + log|S| + 
w'^Ti~^w] and 

T(0,0*,u;) = /'(0*,z)E-'[u;-/(0*,z)]. 

Since T(0, 0*,w) is a linear transform of w, its density is as follows (Gallant, 1987, 
page 122): 

MNT{f’{9*,z)^-^[fo{z) - f{9*,z)]-, f'i9*,z)E-^EE-\r{9*,z)f). (3.8) 
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This applies to a general covariance matrix E, but attention will now be restricted 
to the simple i.i.d. case E = where I is the {n x n) identity matrix. Prom 
(2.2), (2.6) and (3.5), the top line of Ew[\j{O,w)\\0 = 6] is 



[ {me,z)[w{v) - f(r,z)] - [f{e,z){f{e,z)f]) 

• gi{w{v)\6 = 6 ) . ||u;'(t’)|| dv. 

In order to obtain the density g{6) from (2.2), sample the density (3.8), where 
(9* = 0 and T - 0. 

^ ^ O/TT /V /s 1 

g{e) = E4mw)\\O = 0] • \^r{9,z){f{e,z)fn ( 3 . 9 ) 

■^M-^[f0,z)-foiz)f[f{iz)f[f{9,z){f'{e,z)fX[f'{§,z)][f{e,z)-fo{z)]). 

This reduces to a simpler form for a linear model f{0,z) = A(^xp)^(pxi)? when 
g{§) is multivariate normal fo{z)', (X^A)“V^). 



3.4. Nonlinear Regression Model for the Estimation of Antibody Affinities from 
the ELISA Immunoassay 

In this biochemical test, the extent of reaction of various concentrations of a stan- 
dard antigen mixed with a sample of blood can be measured quantitatively. The 
data on the extent of reaction follow a sigmoid saturation profile when plotted 
against log concentration. It has been suggested that the data could be modelled 
using a single logistic function with a homoscedastic normal error distribution 
(Hingley and Ouldridge, 1985; Hingley, 1991): 



/o(^) ~ I 10^0(70+^) ■ 



Here 70 is the mean antibody affinity; ao is a parameter related to the variance of 
the underlying distribution of antibody affinity, 0 < ao < 1; and 2: is log concen- 
tration. 

However the underlying distribution of antibodies could be assumed to be 
of an entirely different form, such as the following combination of two logistic 
functions: 



f{9,z) = 0.5 



1 + 10T'+^-^ 1 + 101+^+^ 



Here the parameter vector 6 = [7,^]^ comprises 7, which represents the mean 
affinity; and 5 which is related to the variance of an underlying two point affinity 
distribution, 5 > 0. A set of data were previously reported for serum from an 
individual animal (Hingley, 1988, Table 1). After re-expressing these data as pro- 
portions of their plateau values, the estimated parameters from a fit of the single 
logistic function are 70 = —2.75, uq = 0.92 and = 0.008787. 

Assuming the correctness of this model, a total of 5000 simulated sets of 
data were generated around a single logistic data generating model with these 
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parameter values, by using normally distributed random numbers. The two-point 
affinity estimation model was then fitted to each set of data (Hingley, 1992). The 
QMLE could only be obtained from 3571 sets of the data (71 per cent of the 
5000 data sets). In the other 1429 cases the log-likelihood remained increasing 
at the parameter boundary 5 = 0 , and no local maximum was available in the 
differentiable region. However for these cases the values of 7 that minimized the 
residual sum of squares along the boundary were distributed in similar proportions 
to the marginal distribution of 7 over the remaining 3571 sets of data. 

For the application of TED to this system, g{9) was determined from equation 
(3.9) using a routine written in APL2. In order to obtain Ew[\j{9,w)\\6 = 0], the 
matrices /'( 0 , z) and /"( 0 , z) were decomposed in order to reflect the contribution 
of the parameters 7 and 5, indexed 1 and 2 respectively. 

/ (^5 ^){ 2 xn ) — 



( /l(lxn) 

\ Alxn) 



/ (^) -^)(2x2xn) 



fu 



11(1 X n) 
(1 X n) 



f" 

12(lxn) 

f" 

•122(1 X n) 



Here fi 2 = /21 ? since the order of differentiation is immaterial. 



Summations in this expression are over the sample members i from 1 to n. 




Figure 1 . Distributions of estimates: bivariate density for ELISA 
data where the data generating model is logistic and the estima- 
tion model is a two point affinity distribution {9 = [ 7 , 5 ]^). The 
exact density g{9), solid lines; Simulation results, dashed lines. 

Figure 1 shows a comparison of the TED expression g{9) to a surface fitted 
to a histogram of the 3571 results. This surface has been normalized so that it 
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would be a valid two dimensional density over the whole set of 5000 simulations. 
The absence of 1429 results means that the volume subtended by the surface is 
approximately 0.71 {= 3571 / 5000), rather than 1. The distribution is fairly sym- 
metric, with a maximum at about 7 = —2.75,5 = 0.32. The analytic and simula- 
tion methods give similar results, although there is some roughness in the surfaces 
due to the simulations. It seems that the analytic method may be modelling only 
the set of maxima in the differentiable region of the likelihood, accounting again 
for about 71 per cent of the data sets. 

The technique described above for determining Eyj[\j{6,w)\\0 = 6] could be 
extended to models involving more parameters, p > 2, by using analytic expres- 
sions for the evaluation of determinants involving cofactors (e.g., Mardia et al., 
1979, page 456). 



4. Robustness 



TED can be applied even if the parameter vector 6 has meaning only on the 
estimation model. But it can also be used as a tool for robustness studies when 
there is a common parameter vector specified on both models. If 0 is such a common 
parameter vector, go{w) can be written as po('^l^o)- 

It is possible to calculate two densities for 0. For both densities assume that 
the data generating model is go{w\0o). The first density assumes that the esti- 
mation model is go{w\0), and is written as g{0) = ho{0\0o)- The second density 
assumes that the estimation model is gi{w\0)^ and is written as g{0) = hi{0\0o). 
A size a robustness index (RI) can be defined as follows. 

1. Calculate a (1 -a) central interval (Cl) for 0, using ho{0\0o)> 

2. Calculate RI as the probability that 0 lies within Cl, using hi{0\0o). 

It can be shown that the RI remains constant for any new metric obtained 
by one to one reparameterisation of 0. 

The following example continues the comparison from Section 3.2 of a gamma 
data generating model with a negative exponential estimation model. Figure 2 
shows the gamma model go{w\0o) for the case a = 2,0 q = l^n = 5. The negative 
exponential estimation model gi{w\0o) is also shown. 

The density hi{0\0o) has already been given in equation (3.7). The density 
^o(^l^o) can be developed in an analogous fashion. 



9 ( 0 ) = 



0(("«)+i)r(na) 



exp[—na0o/0], 



0>O. 



Figure 3 shows ho{0\0o) and hi{0\0o). The 95 per cent Cl is (0.585, 2.085). 
The size 0.05 RI is calculated as 0.352. That is, there is a 35 per cent probability 
that the estimate will lie in the 95 percent interval in which it would have occurred 
if estimation had been carried out on the “correct” model. 
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W 



Figure 2. Data distributions: gamma data generating model 
go{w\6o), solid line; and negative exponential estimation model 
gi{w\6o), dashed line. 




Figure 3. Distributions of estimates g{6): gamma estimation 
model ho{0\6o), solid line; and negative exponential estimation 
model hi{6\6o), dashed line. 
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5. Discussion 

The exact analytic density g{6) for QMLE or MLE, that is given by equation (2.2), 
may turn out to be particularly useful for examining model robustness properties 
for nonlinear regression. Equation (3.9) could easily be extended to cope with 
multiplicative error functions, like = kfo{z). Consideration of other kinds of 
covariance structure, such as those found in simple time series models, should in 
principle be straightforward. It would also be interesting to find an expression for 
g{6) when the covariance matrix is an estimable parameter. 

Methodological extensions could be considered. There should be little dif- 
ficulty to extend TED to the case of two or more independent variables, or to 
situations where the error structure of the models differs instead of, or as well as, 
the model functions. It may also be possible to generalize the approach to cover 
the distributions of other types of quasi M-estimators in addition to QMLE, by 
studying the distributions of terms analogous to T as defined in (2.1). 

The robustness calculation in Section 4 is trivial, but establishes an approach 
that could be applied to more interesting situations, such as nonlinear regression 
with a vector of estimable parameters. For a full study of robustness, it seems 
necessary to postulate a series of candidate data generating models against which 
a particular estimation model can be assessed. By comparison of the various den- 
sities produced, it can be decided whether or not the adoption of a particular 
estimation model actually matters. In some cases a common parameter might be 
better estimated on a wrong model than on the correct one. 
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Empirical Comparison of the Classification 
Performance of Robust Linear and Quadratic 
Discriminant Analysis 
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Abstract. The aim of this paper is to look at the behavior of the total proba- 
bility of misclassification of robust linear and quadratic discriminant analysis. 
The effect of outliers on the discriminant rules is studied by comparing their 
total probabilities of misclassification in presence of outliers. 

Mathematics Subject Classification (2000). Primary 62G35; Secondary 62H30. 

Keywords. Discriminant analysis, total probability of misclassification, robust- 
ness. 



1. Introduction 

In discriminant analysis one observes two groups of multivariate observations 
forming together the training sample. Using these data a discriminant rule is 
determined, that is used afterwards to classify new observations into one of the 
two groups. In this paper we restrain us to the case of two multivariate normal 
distributed populations. We observe p- variate observations xn, . . . , xi^ coming 
from a first population pi Hi = A^p(/xi, Si) and X21, . . . , X2n2 coming from a 
second population p2 ^ H2 = Np{ii2^ S2). 

Supposing that the observations are generated from two multivariate nor- 
mal distributions, it is known that the optimal discriminant rule (i.e., the one 
minimizing the misclassification probability) is a quadratic function given by 

Q{x) = - S2^)x + -M^S2^)x-fc(/ii,/t2,Sl,E2), (1.1) 

k{^iu^l2■,T.uY.2) = ^log -/^2^2 V2). 



where 
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We assign a new variate observation x to pi if 

Q{x) > log "j = T, (1.2) 

\Cl7TiJ 

where ci and C 2 are the costs of misclassifying a unit of, respectively, pi and p 2 
and 7Ti and 7T2 are the prior probabilities that x will belong to, respectively, pi 
and p 2 - In practice these parameters are unknown and therefore we set r = 0 
throughout this paper. 

If we assume that the covariance matrices are equal (S := Si = E 2 ), we get 
the familiar Fisher’s linear discriminant rule 

L{x) = {ni - ij, 2 Y'E~^x - + H 2 ). (1.3) 

Since the primary goal of discriminant analysis is to classify data, we are parti- 
culary interested in the total probability of misclassification of a particular 
discriminant rule. 

Robust linear discriminant analysis has been considered in several papers 
(e.g., Hawkins and McLachlan, 1997; He and Fung, 2000; Croux and Dehon, 2001). 
The first ones to consider robust quadratic discriminant analysis seem to be 
Randles et al. (1978), who used M-estimators for the means //i and fj> 2 , and co- 
variance matrices Ei and E 2 in (1.1) and a rank based rule to estimate the cut-off 
value r in (1.2). One of the Editors of this book also pointed out a forthcoming 
paper of Hubert and Van Driessen (2004), using the MCD-estimators. Note that 
this approach is extendable to the multiple group case. In this paper we compare 
the performance of robust linear and quadratic discriminant analysis using both S- 
and MCD-estimators. Based on simulation experiments, our main finding is that 
it seems to be profitable to use the quadratic over the linear discriminant rule, also 
in presence of outliers. Also, and not surprisingly, the robust rules outperform the 
classical ones when deviating from the model, while performing almost as good at 
the model distribution. 



2. Robustification of Classical Discriminant Analysis 

Outliers and atypical observations might have an influence on the results of clas- 
sical discriminant analysis, since the discriminant rules are based on estimates of 
the population parameters. The outliers and atypical observations might shift the 
estimated means and they might blow up the dispersion matrices. To prevent this 
we make use of robust estimators of the population parameters. For our study we 
will look at two robust estimators, the MCD-estimator and the S-estimator. 

The MCD-estimators were introduced by Rousseeuw (1985). The estimator 
is given by the subset of size h for which the determinant of its covariance matrix 
is minimal. The MCD-estimator of location is then given by the mean of these 
h observations and the MCD-estimator of covariance is given by their covariance 
matrix. 
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The S-estimators of location and multivariate dispersion were jointly intro- 
duced by Davies (1987) and Rousseeuw and Leroy (1987). To define these esti- 
mators, let X be a sample of p- variate observations and let nx be the number 
of observations in the sample. The S-estimators of location and dispersion of this 
sample are defined as 



(p, E) = argmin{deffi} such that 

/X, S 



^ xex 



where eW and E G is a symmetric positive definite matrix. The function 
p needs to satisfy the following condition 

(R) p : R M is a symmetric, continuous, non decreasing function on [0, oo). 
Moreover, p(0) = 0 and p has a continuous derivative in all but finite number 
of points. 

The constant b is set equal to £^iro[p(||Z||)] for Fq, the central model distribution, 
being iV(0, Ip) in our case. The most commonly used choice for p is the Biweight 
function which is defined as 



p{u) = 



{ 



U _U |_ JU 

^ 2c2 6 c4 

~6 



if \u\ < C 
if \u\ > C 



where c is a tuning constant to achieve the desired value of the breakdown point. 

The breakdown point of an estimator is the fraction of the data that can be 
completely contaminated without destroying the estimator. A breakdown point 
is a measure of robustness and resides between 0% and 50%. In this paper, we 
will use the MCD- and S-estimators with breakdown point 25%. The choice of a 
25% breakdown point gives a good compromise between efficiency and robustness 
of the estimators (see e.g., Croux and Haesbroeck, 1999). 

Another way to robustify the classical method is by detecting the influential 
observations and deleting them from the samples. Measures to diagnose influen- 
tial observations in the context of discriminant analysis have been proposed by 
Fung (1995, 1996). By deleting these influential observations from the sample and 
using the clcissical discriminant analysis bcised on the remaining observations, the 
classical method becomes robust. Note that for detecting these observations, it is 
also advised to use robust estimates to avoid the masking effect. Indeed, it is well 
known that diagnostics based on non-robust estimates do not always detect all 
outliers. 

As a measure of performance, we will look at the total probability of mis- 
classification. Formally, the total probability of misclassification (TPM) is given 

by 

TPM = 7TiP{Q{x) > r\x ~ p 2 ) + 7T2P{Q{x) < r\x ~ pi). 

The total probability of misclassification according to a rule can be estimated by 
classifying observations of which the source population is known and look at the 
fraction of misclassified observations. Under the normality assumption the proba- 
bilities of misclassification for the optimal linear and quadratic discriminant rules 
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can be computed theoretically as function of the population parameters of location 
and dispersion. In the case of equal covariance matrices, the total probability of 
misclassification in the linear case is given by 

TPM, inear = * (^) • (21) 



where $ is the cumulative standard normal distributio n function and A is the 
Mahalanobis distance between the populations, namely — /^ 2 )- 

Since the discriminant rule in the linear case is of the form L{x) = a^x + 6, we get 
in the case of populations with different covariance matrices 



TPMjinear = ttiP{L{x) < 0 | x G pi) + 7T2P{L{x) > 0 | x G P 2 ) 

= 7TiP(a^x + 6 < 0 I X ~ ATp(/ii,Ei)) 

+7T2P(a^X + 6 > 0 I X ~ Np{lji2, S 2 )) 



^ / —b — aVi 

TTiP I Z < p z ■ 



z-iV(0,/) 



+7T2P ( 2 > I 2 ~ iV(0,7) 



= 7Ti$ 



a>i 



VO^Sia 



+ 7T2 f 1 - $ 



a>2 



y/a^T,2Ci 



( 2 . 2 ) 



where $ is standard normal cumulative distribution function and a and b are 
as in (1.3), and where E can be estimated by the pooled covariance matrix. A 
theoretical expression of can also be obtained, but it needs to be 

evaluated numerically. For more details and expressions, we refer to Croux and 
Joossens (2003). 



3, Simulation Experiment 

We generate 1000 random normal distributed observations from two populations 
pi ~ ATp(/ii,Ei) and p 2 Ap(/i 2 ,E 2 ) of dimension p = 3, that construct the 
training sample. First, we consider the uncontaminated samples and afterwards 
we contaminate them by adding outliers. We will compute the classical and the 
robust discriminant rules for both the linear and the quadratic case. 

As already mentioned for linear discriminant analysis the discriminant rule 
(1.3) is of the following form 

L(x) = a^x + 6, 

where a is a p-dimensional vector and b a scalar. In the case of quadratic discrim- 
inant analysis the discriminant rule (1.1) is of the following form 

Q{x) = x^ Ax + ^x + e. 
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where A is a p-dimensional matrix , d is a p-dimensional vector and e is a scalar. 
These parameters need to be estimated, and this will be done classically and 
robust, using the MCD-estimators and S-estimators. 

For the MCD-estimators we use the fastmcd algorithm by Rousseeuw and 
Van Driessen (1999) and for the S-estimators we used the algorithm developed by 
Ruppert (1992). For estimating the sample covariance matrix E = Ei = E 2 in the 
linear case a pooled covariance matrix estimate is used. Programs for computing 
robust linear and quadratic discriminant analysis can be retrieved from the web site 
http://www.econ.kuleuven.ac.he/christophe.croux. Note that our primary interest 
is not in the parameter estimates of the linear and quadratic rule, but only in the 
probability of misclassification of the rules. 

After constructing the discriminant rules we generate 5000 random normally 
distributed observations for each population, without contamination. These 
validation samples have the same distribution as the training samples (if we do 
not take the outliers into account). These observations are classified by all the 
discriminant rules. In linear discriminant analysis we assign an observation x to 
the first population if and only if L{x) <0 and in quadratic discriminant analysis 
if and only if Q{x) < 0. Since the source populations of these observations are 
known, we are then able to decide whether an observation is then misclassified. 
The fraction of misclassified observations is then an estimate of the total probabil- 
ities of misclassification using the specific discriminant rule. By taking a number 
of 5000 observations in the validation sample, we aim at attaining an accurate 
estimate of the population misclassification rate. Indeed, for a given discriminant 
rule, the standard error of this estimate is always less then 0.71%. 

Since the classification rules depend strongly on the generated data of the 
training sample, we generate 500 different training data sets, on which we apply 
both linear and quadratic discriminant rules based on classical and robust (MCD- 
and S-) estimators. Working with 500 different training sets allows us to take the 
estimation variability of the discriminant rules into account. Indeed, we can com- 
pute the mean and the standard deviation of the total probability of misclassifica- 
tion over the 500 runs for linear and quadratic, classical and robust discriminant 
analysis. 



4. Simulation Results 

We notate the total probability of misclassification in percents and put the as- 
sociated standard deviations between parenthesis (also in percents). The the- 
oretical values of the TPM, in absence of contamination, are also mentioned. 
Let us consider 3 cases and take rather extreme cases to illustrate the effect 
of outliers. For simplicity of notation, a stands for a vector (a, a, a)^ and I for 
the 3-dimensional identity matrix. Note that similar simulation experiments, but 
only for the linear case, were constructed in He and Fung (2000) and Croux and 
Dehon (2001). 
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4.1. Unequal Means and Equal Covariance Matrices 

The two populations have the same covariance matrix, but different means. 
Because of the equality of the covariance matrices, the linear method should 
be preferred because it is much easier then the quadratic and it should lead 
(asymptotically) to the same rules. Note that in the case of equal covariance 
matrices, the theoretical values of the total probability of misclassification for 
the quadratic rule are the same as for the linear rule, computed as in (2.1). 
The two populations consist of 1000 observations, where pi ~ Ns{—l]I) and 
p 2 ~ ^3(1;^)- Let us now contaminate 10% of observations of each population, 
deviating in location from the original distributions. The outliers of the first 
populations follow a Ns{9'^I) distribution and those of the second population a 
Ns{—9]I) distribution. As a second type of contamination, 10% of the data are 
switched from one group to the other. 

Table 1. Average TPM with standard deviation between paren- 
thesis over 500 runs, in case of equal covariance matrices. Linear 
and quadratic discriminant rules based on the classical, the MCD- 
and the S-estimators are considered. 





Theoretical 


Classic 


MCD 


S 


Linear 

Quadratic 


4.16 

4.16 


4.09 (0.02) 

4.10 (0.02) 


4.13 (0.05) 
4.16 (0.06) 


4.09 (0.02) 

4.10 (0.02) 


Linear 

Quadratic 


in presence of 10% outliers 
49.95 (3.60) 
49.88 (2.85) 


in location of type 1 
4.12 (0.04) 4.09 (0.02) 
4.14 (0.05) 4.10 (0.02) 


Linear 

Quadratic 


in presence of 10% outliers 
4.21 (0.04) 
4.24 (0.04) 


in location of type 2 
4.20 (0.03) 4.20 (0.02) 
4.23 (0.04) 4.21 (0.04) 



From Table 1 it seems clear that the results coming from the linear discrim- 
inant analysis are close to those of the quadratic discriminant analysis. When 
the populations are contaminated by extreme outliers, the robust estimators are 
obviously much better than the classical estimators. Note that in this case the 
S-estimator behaves more robust than the MCD-estimator, yielding lower values 
for the simulated TPM. But, taking the standard errors into account, this differ- 
ence is not significant. For the second type of contamination all discriminant rules 
perform equally well. 

4.2. Equal Means and Unequal Covariance Matrices 

Let us now consider populations with the same mean, but with different covariance 
matrices. For the computation of the theoretical values of the total probability 
of misclassification in the linear case we use formula (2.2) and in the quadratic 
case we use the formula proposed by Croux and Joossens (2003). For the linear 
discriminant rule, the pooled sample covariance matrix is used. 
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Let pi and p 2 be two populations consisting each of 1000 observations fol- 
lowing a AT3(0; 100/) and a Aa(0; I) distribution, for respectively the first and the 
second population. We contaminate again 10% of the observations, but now to get 
outliers in dispersion. We generate them as they would come from the wrong pop- 
ulation. This means that 100 observations have 100/ as covariance matrix instead 
of /, for the first population and vice versa for the second population. This leads 
to the following results. 

Table 2. As in Table 1, but in case of equal means and unequal 
covariance matrices. 

Theoretical Classic MCD S 

Linear 5000 45.73 (1.89) 43.48 (2.66) 45.56 (1.89) 

Quadratic 0.82 0.83 (0.01) 0.82 (0.18) 0.83 (0.01) 

in presence of 10% outliers 

Linear 46.64 (2.21) 44.95 (2.16) 45.98 (1.70) 

Quadratic 7.24 (0.46) 1.75 (0.13) 1.11 (0.03) 

In case of equal means and unequal covariance matrices we notice a signif- 
icant difference between the linear and the quadratic discriminant analysis. The 
quadratic method is much better, which is logical because the linear method as- 
sumes that the covariance matrices are equal. In case of contamination the robust 
method is again better than the classical and the S-estimator is better than the 
MCD-estimator. 

4.3. Unequal Means and Unequal Covariance Matrices 

Consider now two populations with different means and different covariance 
matrices. Since this is the more representative case, we consider three different 
sampling schemes with diflFerent degrees of overlap between the two populations. 
This results in increasing values for the total probability of misclassification. Prom 
a practical point of view, it means that Scheme 1 corresponds to “easy” classifica- 
tion problems, while the last scheme is a more difficult one. 

Scheme 1: Let pi ~ A^ 3 (— 1;/) and p 2 ~ /73(1;0.25/) be two populations con- 
sisting of each 1000 observations. First we contaminate the samples, by creating 
10% outliers in location such that the classical mean estimators should become 
close: 100 observations of pi follow a W3(9; /) distribution and 100 observations 
of p 2 follow Ns{—9]0.251) distribution. Secondly, we change these outliers, such 
that they deviate in location and dispersion. This is done by changing their cor- 
responding covariance matrices / and 0.25/ into 0.25/ and /, for respectively, the 
outliers in the first and the second population. 

Scheme 2: Let pi ~ iV3(0;2.25/) and p 2 ~ iV 3 (l; 0.25/). We contaminate 
100 observations in each of those two populations such that they deviate in location 
and dispersion. The outliers of the first population follow a Ns{3; 91) distribution 
and those of the second population follow a AT 3 (— 1; /) distribution. 
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Scheme 3: Let us consider two populations, pi ~ N3(0;4I) and p 2 ^ Ns(l; 161), 
consisting each of 1000 observations. The first type of outliers are generated by 
replacing 100 observations of pi by observations coming from a Ns{4:;I) distri- 
bution and 100 from p 2 by observations coming from a Ns{—16’,I) distribution. 
A second type of outliers is generated by replacing 100 observations from each 
population as if they would come from the wrong population. In other words, 
the first population consists of 900 observations coming from a 41) distribu- 
tion and 100 coming from a AT 3 ( 1 ; 16/) distribution and vice versa for the second 
population. 

Table 3. Average TPM with standard deviation between paren- 
thesis over 500 runs, in the most general case of unequal means 
and unequal covariance matrices. Linear and quadratic discrimi- 
nant rules based on the classical, the MCD- and the S-estimators 
are considered. Three different sampling schemes are considered. 



SCHEME 1 


Theoretical 


Classic 


MCD 


S 


Linear 

Quadratic 


1.92 

0.74 


2.05 (0.08) 
0.75 (0.08) 


2.09 (0.13) 
0.86 (0.08) 


2.05 (0.08) 
0.75 (0.09) 




in presence 


of 10% outliers in location 




Linear 

Quadratic 




49.89 (3.61) 
26.44 (0.06) 


2.13 (0.11) 
0.80 (0.04) 


2.12 (0.08) 
0.87 (0.04) 


Linear 

Quadratic 


in presence 


of 10% outliers 
49.72 (3.60) 
26.47 (0.12) 


in location and dispersion 
2.05 (0.11) 2.04 (0.08) 

0.78 (0.05) 0.83 (0.04) 


SCHEME 2 


Linear 

Quadratic 


18.51 

6.82 


16.20 (0.06) 
6.87 (0.03) 


16.29 (0.14) 
8.80 (0.08) 


16.20 (0.06) 
6.87 (0.03) 


Linear 

Quadratic 


in presence 


of 10% outliers 
16.65 (0.45) 
13.04 (0.26) 


in location and dispersion 
16.29 (0.10) 16.23 (0.01) 
8.00 (0.24) 8.27 (0.15) 


SCHEME 3 


Linear 

Quadratic 


38.82 

20.67 


37.29 (0.31) 

20.30 (0.05) 


37.40 (0.48) 

23.40 (0.34) 


37.29 (0.32) 
20.31 (0.05) 


Linear 

Quadratic 


in presence of 10% outliers of type 1 

56.05 (0.82) 37.48 (0.44) 
23.74 (0.19) 21.85 (0.23) 


39.82 (0.63) 
23.03 (0.17) 


Linear 

Quadratic 


in presence of 10% outliers of type 2 

37.86 (0.33) 38.20 (0.33) 
20.73 (0.11) 20.42 (0.10) 


38.12 (0.35) 
20.36 (0.08) 



In all contaminated cases (see Table 3), the quadratic rule outperforms the linear 
one. Classification based on robust estimates is much better than based on classical 
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estimates in presence of outliers. (Note that for Scheme 3, with less extreme outliers 
as in the previous contamination schemes, the classical quadratic rule is not loosing 
much.) In the first scheme, for the linear case in the second scheme and for the 
second type of outliers in the third scheme, the robust discriminant analysis based 
on S-estimator is slightly better, however not significantly, than the one based on 
the MCD-estimator in presence of outliers (linear and quadratic). But for the first 
type of outliers in the third scheme and for the quadratic case in the second scheme 
the robust classification based on the discriminant rules using the MCD-estimator 
is slightly better than the one based on the S-estimator in presence of outliers. 

If no outliers are present, the S-estimator yields lower misclassification rates 
than the MCD-estimator. Prom these simulations in the case of unequal means and 
covariance matrices, we can conclude that quadratic discriminant analysis is always 
preferred to linear discriminant analysis and that the robust method is better than 
the classical method. We notice that sometimes the robust discriminant analysis 
based on the MCD-estimators is better then the robust discriminant analysis based 
on the S-estimators and sometimes it is vice versa. Therefore we cannot say which 
of the robust estimators is preferred. 



5. Conclusions 

It is shown that quadratic discriminant analysis is needed when the covariance 
matrices of the populations are different. Robust estimators are needed if there are 
outliers in the populations. If no outliers are present in the training sample there 
is practical no loss of the robust procedure in classification. Therefore, the only 
major loss when using robust discriminant rules is computation cost. Note however 
that fast algorithms have been used and that computer software is available to 
compute the robust estimators considered in this paper. The choice for MCD- 
and S-estimators was motivated by this computational aspect and by their high 
breakdown point. 

While this proceedings paper focuses on simulation aspect, a more formal 
and theoretical treatment is provided in Croux and Joossens (2003), where the 
influence of outliers on the quadratic discriminant rule and its total probability of 
misclassification is studied. 
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Estimates of the Tail Index Based 
on Nonparametric Tests 

J. Jureckova and J. Picek 



Abstract. The authors recently constructed several nonparametric tests of 
one-sided hypotheses on the value of the Pareto-type tail index in the fami- 
ly of distributions with nondegenerate right tails. Inverting the tests in the 
Hodges- Lehmann manner (Hodges and Lehmann, 1963), we obtain strongly 
consistent estimators of m. The simulation study demonstrates surprisingly 
good approximations of m, namely by two of the three proposed estimators. 
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Keywords. Pareto- type tail index; strong approximation of empirical process. 



1. Introduction 

The authors recently constructed several nonparametric tests of one-sided hypothe- 
ses on the value of the Pareto-type tail index in the family of distributions with 
nondegenerate right tails; see Fialova et al. (2004), Jureckova (2000), Jureckova 
and Picek (2001) and Picek and Jureckova (2001). The proposed tests, fully non- 
parametric, are based on splitting the set of observations into N subsamples of 
sizes n and on the empirical distribution functions of N subsamples statistics of 
various types; namely, the subsample extremes in Jureckova and Picek (2001), 
the subsample means in Fialova et al. (2004), the averages of two subsubsample 
extremes in Picek and Jureckova (2001), and the linear combinations of extreme 
regression quantiles in the linear regression model (see Jureckova, 2000). 

The model assumes that the distribution function F of the observations sa- 
tisfies 1 — F{x) = x"’^L(x), where m > 0 is the parameter of interest and L{x) is 
a function, slowly varying at infinity. As such, the problem is semiparametric in its 
nature, involving an unknown slowly varying function. The tests of the one-sided 

Research of first author was supported by the Czech Republic Grant 201/02/0621 and by the 
Project CEZ: MSM 113200008 “Mathematical Methods in Stochastics”. Research of second one 
supported by the Czech Republic Grant 205/02/0871 and by Postdoctoral Grant from the Por- 
tuguese Pundagao para a Ciencia e a Tecnologia (FCT) at the University of Lisbon (Centro de 
Estatfstica e Aplicagoes). 
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hypothesis Hq : m < mo (or Hi : m > mo) are consistent against one-sided 
alternatives m > mo (or m < mo, respectively) as N ^ oo and n remains fixed; 
the asymptotic null distributions of the test criteria are normal. The simulation 
studies in the cited papers show that the tests distinguish well the distribution 
tails already for moderate samples. 

Inverting such test in the Hodges-Lehmann manner (Hodges and Lehmann, 
1963), we obtain a strongly consistent estimate Mn of m. The simulation study 
demonstrates a surprisingly good approximation of m by Mjv, mainly by the esti- 
mators based on the tests proposed in Jureckova and Picek (2001) and Picek and 
Jureckova (2001). 

2. Estimator of m Based on Subsample Extremes 

Let us have a dataset of nN independent observations, identically distributed with 
distribution function F satisfying 

l-F{x)=x-^L{x), (2.1) 

where m, 0 < m < oo is the parameter of interest and L{x) is a function, 
slowly varying at infinity. To avoid an eventual effect of observations ordering, 
the data may be randomly permuted before starting an inference. The observa- 
tions are then partitioned into N samples Xj = (Xji, . . . ,Xj^)^ of equal fixed 
sizes n, j = 1, . . . , A^. Denote • • • ? the respective sample maxima. Then 

is a random sample from distribution F*(x) = F^{x). Let F^ be 
the empirical distribution function of the sample maxima, i.e.. 



N 



II 

lA 


(2.2) 


For any fixed m > 0, put 






(2.3) 



where 0 < 5 < ^ is a chosen constant; then limiv-^oo O'N.m = oo. 

The test of : m < mo against K^o : m > mo, proposed in Jureckova 
and Picek (2001), rejects the hypothesis at the asymptotic significance level 
a G (0, 1) provided 

either 1 - F^{aN,mo) = 0^ 

or 1 - F^{aN,mo) > 0 and (2.4) 

[_log(l - (1 -<^)logiv] > 

where $ is the standard normal distribution function. 

As it was proved in Jureckova and Picek (2001), 

lim P„o (o < F^{aN,mo) < l) = 1 

N-^OO \ J 



(2.5) 
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for every F satisfying 1 - F{x) > x~'^^ for Vx > xq. Moreover, if F has exactly 
the Pareto tail corresponding to 1 — F{x) = for Vx > xq, then the left-hand 
side of (2.4) has asymptotically normal distribution, as N oo : 

If F satisfies 1 — F{x) > x~'^ for Vx > xq for some m < mo, then 

^N^ooPmo [-log(l - n(«iV,mo)) “ (1 ~ <^) log iv] > - a)} < a. 

Analogously, we reject the hypothesis : m > mo against : m < mo, 
provided 

either F^{aN,mo) = 0, 



or F^{aN,mo) > 0 (2.6) 

AtV2 - (1 - 5)\ogN] < 

The last inequality is equivalent to 

AT^/2 [iog(i _ F*^{aM,mo)) + (1 - 5) logiv] > $-'(1 - a). (2.7) 

The quantity ajv,m defined in (2.3) is decreasing in m for any fixed A", n, hence 
the statistic 

log(l-F^(a^,^)) + (l-(5)logA 

is nondecreasing in m for fixed A, n. Because it is discontinuous, the equation 

log(l - n(aiv,m)) + (1 - 5) logiV := 0 [= 

generally has no solution with respect to m, and we should rather look for m 
satisfying the approximate equation 

l-FUaN,m))-N-^^-^\ 

Following these ideas, we define an estimator of m as 

Mn = \{M++M^), ( 2 . 8 ) 

where 

= sup{s : 1 - Fi^(ajv,.)) < 

(2.9) 

M+ = inf{s : 1 - F*^{aN,s)) > 

In this way we obtain a strongly consistent estimator of m (> 0), as it is shown in 
the following theorem: 



Theorem 2.1. If F has the right tail with tail exponent m > 0, i.e., 1 — F(x) = 
x~^L(x) where L(-) is slowly varying at oo, then 



Mn m a.s. as N oo. 



( 2 . 10 ) 
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Proof. Because the relative increase or decrease of L{x) is smaller than any power 
of X, it is sufficient to prove (2.10) for F with exact Pareto right tail with index 
m, i.e., for F satisfying 1 - F{x) = x~'^ for x > xq. The relations (2.8) and (2.9) 
imply that, for any e > 0, 

Pm(|Mjv - m| > £) < Fm{M^ >m + s} + Pm{M^ <m-e} 

(2.11) 

< P„{1 - F*^{aN,m+e) < N-^+^} + Fm{l ~ F*^{aN,m-e) > 



Denote by >ljv,e the event 



■^N,e = 



1 - n(aiv,m+.) < 



(2.12) 



Then, for an integer K > 0, 

f oo oo 

Pm { n u 



^k=K N=k 



(2.13) 



l-F*(flv,m+e) “ ~ 1 



By Komlos et al. (1975), there exists a Brownian bridge Bn depending on 
such that 

iVi/2[F*(a;v ,m+e:) F (UiV,m+£:)] - BN{F*{aN,m+e)) 

(2.14) 

= 0(Ar-i/2 jQg 

as AT — > oo. Hence, by (2.13) and (2.14), 

Pm{nr=icUv=.^iv,j 



fN-^/^BN{F*{aN,m+e)) + 0,.s.{N-HogN) 
l-F*{aN,m+e) 

> 1 — i.o. for N > 

i N-^/^BN{F*{aN,m+e)) 

"*1 1 - F*{aN,m+e) 

>1 ^1 — i.o. for N > 

+Pm (a.s. logiv) 

> I ^1 — m+e ^ i.o. for N > A'j . 



(2.15) 
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If Wn = Oa.s. ”>+^ log , then, for any ?7 > 0, there exists a constant 

C{rj) >0 and an integer K{t]) such that 



F^^WN>C{r|)N 



-l+(l-<5)- 



logAf i.o. for N >K^ <r] for K > K{rj), 



hence, 

V ^ (^l - i.o. for N > < rj for K > K{rj). 

On the other hand, 

f N-y^BN{F*{aM,m+e)) 

'"I 1 - F*{aM,m+e) 

> 5 ^1 — n~ ”»+= ">+s ^ i.o. for N > k'^ 

= Pto i ^1 - i.o. for N > 



(2.16) 



OO 

- Fm |^jv,£ > ^ fl - 



N=K 



where Zn,s is normal, with mean zero and variance Hence, for 

’ [ 1 -F *{aN,m + e)) ’ 



sufficiently large N, 

Pm {^iV,e > i (l - 



< 



(1 '^)m+£)|j 2(m + £) mTc iV m + E 'l 

1_$ I ^ 1 

2F*{aN^rn+e) 



< l-$ 






(2.17) 



< 





32 



} 



where we have used the inequality 1 - $(a;) < a; > 0. It follows from 

(2.17) that, for any 77 > 0, there exists an integer K{rj) such that 

OO 

Y, Pm{2iv,. >|(l-n-^iV-(i-^)^^)}<77. (2.18) 

N=K{v) 



It then follows from (2.13)-(2.18) that, for any r/ > 0, there exists an integer K{t]) 
such that 
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Similarly, denoting 

An,-, = [l - n(«iV,m-e) > 

we shall prove that for any ry > 0, there exists an integer K{rj) such that 

{ oo oo 'j 

Pi y An-s > < for K > K{t]) (2.20) 

k=K N=k ) 

and (2.11), (2.19) and (2.20) imply that Mn rn a.s. as N oo. 



3. Estimator of m Based on Subsample Means 

Similarly as in Section 2, let Xj = (Xji, . . . , Xjn)^ j = I,. .. ,N he N independent 
samples from the population with distribution function F satisfying (2.1). Let 
Xn\ . . . , be the respective sample means. Then Xn\- • • , Xn^^ is a random 
sample from a distribution F^{x). Denote F^ the empirical distribution function 
of the sample means, i.e., 

(3.1) 

j=i 

Let 

= 0<(5< 1. (3.2) 

The authors in Fialova et al. (2004) proposed a test rejecting H^o : m < mo if 
either F^\an) = 1 
or if F^\an) < 1 and simultaneously 



iV^/2 [- log (l - F£^(a„)) - (I -S) logiv] > $-1(1 - a) (3.3) 

where ^ is the standard normal distribution function. It was proved in Fialova et 
al. (2004) that 



lim P„o (o < F%{aN,mo) < l) = 1 (3.4) 

for every F satisfying 1 - F{x) > x~^^ for Vx > xq, and if F has exactly the 
Pareto tail corresponding to 1 - F{x) = x~^^^ for Vx > xq, then 






log(l - P^{aN,mo)) - (1 - S)logN < a;| = $(x), x e 



Analogously as in (2.8), we propose an estimator M of m of the following form: 

Af]v = ^{M + Afjy), (3-5) 

where 

Mn = sup{s : 1 - F^(ajv,s)) < 

MN=mf{s: l-PUaN,e))> 



(3.6) 
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Mn defined in (2.8) and (2.9) is a strongly consistent estimator of m (> 0). 
The proof is quite analogous as the proof of Theorem 2.1; it uses the fact that, 
provided F satisfies (2.1), the distribution function of the sample means satisfies 
1 — F^{x) = X G with L* being a slowly varying function (see 

Fialova et al., 2004). 



4. Estimator of m Based on Other Subsample Characteristics 

Observe N independent samples Xj = (Xji, . . . ^Xjn)' of the fixed size n, j = 
1, . . . , A/". Fix z/, 1 <v <n — \ and denote 

0W = i(x]'^ + xf), (4.1) 

where 

xf^ = max{X,i, . . . , Xjv}, xf = max{X,>+i, . . . , j = 1, . . . , iV. 

Then (0i, . . . , 6^) is a random sample from distribution F** (say); denote F^* the 
corresponding empirical distribution function. Put 

0 < J < i, m > 0. (4.2) 

The authors proposed in Picek and Jureckova (2001) the test of Ho : m < mo 
with the critical region 

{(Xi, . . . ,Xiv) : either 1 - n*(ajv,mo) = 0 

or 1 - F^*{aN,mo) > 0 and (4.3) 

_ F*^*{aN,mo)) - (1 - <^)logiv] > #-1(1 - a)}, 
and, if F has the exact Pareto tail with index mo, 
Jirn^P„„{Ar^/2[-log(l-n*(aiv,mo))-(l-<5)logAr] <x} =#(x), x € R. 

We propose an estimator Mn of m in the following way: 

Mn = U^n + Mm)^ (4-4) 

where 

M-=sup{s: l-P;;{aN,s))<N-^^-^^}, 

M+=inf{s: l-F)^*(aiv,.))>iV-(i-^)}. 

Then Mn is a strongly consistent estimator of m (> 0); the proof goes along the 
lines of that of Theorem 2.1 of Section 2. 
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5. Numerical Illustration 

The performance of the proposed estimation procedures is illustrated on the si- 
mulated random samples of sizes 1000 and 200, split into N = 200 subsamples of 
sizes n = 5 and N = 50 subsamples of sizes n = 4 respectively; the samples were 
generated 1000 times from the following distributions: 



Pareto 


F{x) = 1 - 


Burr 


F{x) = 1 - 




a: > 0 
a: > 0 



Generalized 
Pareto F{x) 



1 - 
0 



if a: > 0, 0 < m < 00 , /? > 0, 
if 0 < X < -m/3, m < 0, /3 > 0, 
if m = 00 , ^ > 0, 
otherwise. 



For either of these distributions we proceeded as follows: 

(1) Generated independent observations Xj = (Xji, . . . , Xjn)'^ 
j = 1, . . . , AT; 

(2) computed 

A) maxima = max(Xji, . . . , Xjn), ji = 1, . . . , iV; 

B) sample means X ^^^ , . . . , x[^ ; 

C) means of two maxima 

6j ^ = 2 (max(Aji, . . . , Xj^^/2]) “1“ + • • • ? Xjn)) J 

j - 1,...,AT; 

(3) found the corresponding empirical distribution function for each of the quan- 
tities A-C; 

(4) computed the estimators Mjv, Mat, and Mn corresponding to A-C, respec- 
tively, for 5 = 0.01, 0.02, . . . , 0.49. 

(5) For a comparison, the Hill estimator 



k 

^ ^ ^ -^(iV-n— 2+1) X(^]\j.fi—k) 

i=l 



was computed for fc = 1, . . . , 999. 

(6) The steps (l)-(5) were repeated 1 000 times; 

(7) selected sample statistics of pertaining estimates were computed and tabula- 
ted. 
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The tables show that the nonparametric estimators based on the tests can 
be considered as comparable with the Hill estimator, most popular among va- 
rious estimators of the tail index. While the Hill estimator is better under exact 
Pareto tails, as it can be expected, the nonparametric estimators show a better 
performance when the algebraic tail is contaminated by a slowly varying function, 
mainly Mn and Mn] a more extensive simulation study, as well as a study of the 
asymptotic bias and variance, is in a progress. 





Figure 1. Mean values of estimators Mn, (solid) Mn, (dotted) 
and Mn (dashed) plotted against S {left) and of the Hill estimator 
plotted against k (right)] the true distribution is Burr with m = 
1, Of = 1; iV = 200, n = 5. 





Figure 2. MSE values of estimators Mn, (solid) Mn, (dotted) 
and Mn (dashed) plotted against S (left) and of the Hill estimator 
plotted against k (right) ^ the true distribution is Burr with m = 

1, a =1] N = 200, n = 5. 

The question is the choice of k for the Hill estimator and of S for our procedures. To 
compare our estimators with the Hill estimator, we followed the standard approach 
of minimizing the mean squared error (MSE); the following tables (Table 1 and 
Table 2) give the selected sample statistics of estimators of m for various distribu- 
tion shapes of errors. Figures 1-2 show the behavior of the respective means and 
the MSE of the estimators plotted against the choice of 5 (or of fc, respectively). 
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Table 1. Sample statistics of the estimates of the Pareto index 
under minimal MSB for various distributions; N = 200 and n = 5. 
The minimal MSB is denoted by *. 



sample 


method 


fraction 


MSB 


Mean 


Median 


Var 


MAD 


Pareto 


Hill 


fc = 999 


*0.0003 


0.5002 


0.4992 


0.0003 


0.0160 


m = 0.5 


A 


S = 0.49 


0.0011 


0.5093 


0.5077 


0.0010 


0.0317 




B 


5 = 0.01 


0.0047 


0.4719 


0.4719 


0.0039 


0.0624 




C 


6 = 0.48 


0.0024 


0.5340 


0.5339 


0.0012 


0.0353 


Pareto 


Hill 


fc-999 


*0.0010 


1.0003 


0.9984 


0.0010 


0.0321 


m = 1 


A 


S - 0.49 


0.0043 


1.0188 


1.0157 


0.0040 


0.0634 




B 


6 = 0.44 


0.0076 


0.9642 


0.9641 


0.0063 


0.0778 




C 


S = 0.49 


0.0058 


1.0435 


1.0421 


0.0039 


0.0643 


Pareto 


Hill 


fc = 999 


*0.0091 


3.0010 


2.9951 


0.0091 


0.0962 


m = 3 


A 


S = 0.49 


0.0392 


3.0570 


3.0476 


0.0360 


0.1904 




B 


5 = 0.49 


0.7612 


3.8392 


3.8346 


0.0569 


0.2347 




C 


(5-0.12 


0.0566 


2.8928 


2.8957 


0.0452 


0.2127 


Burr 


Hill 


fc = 113 


0.0025 


0.4758 


0.4744 


0.0019 


0.0442 


a = 1 


A 


<5 = 0.49 


*0.0012 


0.5111 


0.5095 


0.0010 


0.0324 


m = 0.5 


B 


S = 0.01 


0.0047 


0.4721 


0.4720 


0.0039 


0.0626 




C 


S - 0.43 


0.0026 


0.5348 


0.5349 


0.0014 


0.0392 


Burr 


Hill 


k = 113 


0.0098 


0.9517 


0.9489 


0.0075 


0.0885 


a = \ 


A 


5 - 0.49 


*0.0047 


1.0226 


1.0192 


0.0042 


0.0647 


m = 1 


B 


8 = 0.45 


0.0077 


0.9770 


0.9733 


0.0072 


0.0827 




C 


8 = 0.49 


0.0071 


1.0528 


1.0509 


0.0043 


0.0678 


Burr 


Hill 


fc = 113 


0.0884 


2.8551 


2.8466 


0.0675 


0.2655 


a = \ 


A 


8 = 0.49 


*0.0423 


3.0682 


3.0580 


0.0377 


0.1942 


m = 3 


B 


8 = 0.49 


2.6325 


4.5726 


4.5460 


0.1597 


0.3832 




C 


8 = 0.14 


0.0555 


2.9051 


2.9040 


0.0465 


0.2177 


General. 


Hill 


fc = 311 


*0.0010 


0.4847 


0.4841 


0.0007 


0.0261 


Pareto 


A 


8 = 0.45 


0.0042 


0.5519 


0.5491 


0.0015 


0.0387 


m = 0.5 


B 


8 = 0.26 


0.0036 


0.4661 


0.4639 


0.0025 


0.0496 




C 


<5 = 0.38 


0.0080 


0.5768 


0.5751 


0.0021 


0.0455 


General. 


Hill 


fc- 113 


0.0098 


0.9517 


0.9489 


0.0075 


0.0885 


Pareto 


A 


8 - 0.49 


*0.0047 


1.0226 


1.0192 


0.0042 


0.0647 


m = 1 


B 


8 = 0.45 


0.0077 


0.9770 


0.9733 


0.0072 


0.0827 


/3 = 1 


C 


8 = 0.49 


0.0071 


1.0528 


1.0509 


0.0043 


0.0678 


General. 


Hill 


fc-24 


0.5527 


2.4329 


2.3598 


0.2314 


0.4397 


Pareto 


A 


8 = 0.01 


0.6412 


2.2253 


2.2293 


0.0411 


0.2023 


m = 3 


B 


(5 = 0.11 


*0.1166 


2.8500 


2.8384 


0.0942 


0.3179 


fi=l 


C 


8 = 0.01 


0.6844 


2.1925 


2.1974 


0.0324 


0.1767 
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Table 2. Sample statistics of the estimates of the Pareto index 
under minimal MSE for various distributions; N = 50 and n = 4. 
The minimal MSE is denoted by *. 



sample 


method 


fraction 


MSE 


Mean 


Median 


Var 


MAD 


Pareto 


Hill 


k = m 


*0.0013 


0.5023 


0.4998 


0.0013 


0.0359 


m = 0.5 


A 


6 = 0.49 


0.0036 


0.5210 


0.5204 


0.0032 


0.0589 




B 


S = 0.01 


0.0084 


0.4772 


0.4746 


0.0078 


0.0855 




C 


5 = 0.49 


0.0031 


0.5151 


0.5148 


0.0029 


0.0565 


Pareto 


Hill 


k = 199 


*0.0050 


1.0047 


0.9995 


0.0050 


0.0718 


m = l 


A 


6 = 0.49 


0.0147 


1.0435 


1.0427 


0.0128 


0.1173 




B 


5 = 0.44 


0.0204 


0.9356 


0.9337 


0.0163 


0.1307 




C 


5 = 0.49 


0.0086 


0.9851 


0.9839 


0.0084 


0.0977 


Pareto 


Hill 


fc = 199 


*0.0453 


3.0140 


2.9985 


0.0452 


0.2153 


m = 3 


A 


S = 0.49 


0.1328 


3.1337 


3.1302 


0.1150 


0.3477 




B 


5 = 0.49 


0.3045 


3.4054 


3.3917 


0.1403 


0.3948 




C 


S = 0.01 


0.1430 


2.7561 


2.7743 


0.0837 


0.2857 


Burr 


Hill 


fc = 42 


0.0069 


0.4551 


0.4480 


0.0049 


0.0682 


a = l 


A 


S = 0.49 


0.0045 


0.5282 


0.5268 


0.0037 


0.0627 


m = 0.5 


B 


J = 0.19 


0.0086 


0.4460 


0.4428 


0.0057 


0.0754 




C 


<5 = 0.49 


*0.0039 


0.5230 


0.5219 


0.0034 


0.0613 


Burr 


Hill 


fc = 42 


0.0275 


0.9102 


0.8960 


0.0194 


0.1364 


q; = 1 


A 


6 = 0.49 


0.0181 


1.0581 


1.0557 


0.0147 


0.1252 


m = 1 


B 


6 = 0.49 


0.0233 


0.9667 


0.9569 


0.0222 


0.1566 




C 


6 = 0.49 


*0.0105 


1.0075 


1.0061 


0.0105 


0.1089 


Burr 


Hill 


A; = 42 


0.2471 


2.7306 


2.6881 


0.1747 


0.4093 


q; = 1 


A 


5 = 0.49 


0.1641 


3.1779 


3.1693 


0.1326 


0.3745 


m = 3 


B 


5 = 0.49 


2.6905 


4.4337 


4.3423 


0.6357 


0.7580 




C 


6 = 0.01 


*0.1363 


2.7963 


2.8095 


0.0949 


0.3004 


General. 


Hill 


k = 82 


*0.0031 


0.4744 


0.4714 


0.0025 


0.0455 


Pareto 


A 


S = 0.40 


0.0116 


0.5775 


0.5756 


0.0056 


0.0747 


m = 0.5 


B 


5 = 0.29 


0.0068 


0.4781 


0.4714 


0.0063 


0.0788 


(3=1 


C 


5 = 0.45 


0.0105 


0.5752 


0.5743 


0.0048 


0.0733 


General. 


Hill 


fc = 42 


0.0275 


0.9102 


0.8960 


0.0194 


0.1364 


Pareto 


A 


S = 0.49 


0.0181 


1.0581 


1.0557 


0.0147 


0.1252 


m = 1 


B 


5 = 0.49 


0.0233 


0.9667 


0.9569 


0.0222 


0.1566 


p=l 


C 


5 = 0.49 


*0.0105 


1.0075 


1.0061 


0.0105 


0.1089 


General. 


Hill 


fc = 12 


0.9708 


2.2516 


2.1660 


0.4111 


0.5783 


Pareto 


A 


6 = 0.01 


0.8047 


2.1483 


2.1436 


0.0793 


0.2732 


m = 3 


B 


S = 0.01 


*0.2903 


2.7811 


2.7509 


0.2426 


0.4648 


(3=1 


C 


5 = 0.01 


1.0657 


1.9898 


1.9940 


0.0452 


0.2103 
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On Mardia’s Tests of Multinormality 

A. Kankainen, S. Taskinen and H. Oja 

Abstract. Classical multivariate analysis is based on the assumption that the 
data come from a multivariate normal distribution. The tests of multinor- 
mality have therefore received very much attention. Several tests for assess- 
ing multinormality, among them Mardia’s popular multivariate skewness and 
kurtosis statistics, are based on standardized third and fourth moments. In 
Mardia’s construction of the affine invariant test statistics, the data vectors 
are first standardized using the sample mean vector and the sample covariance 
matrix. In this paper we investigate whether, in the test construction, it is ad- 
vantageous to replace the regular sample mean vector and sample covariance 
matrix by their affine equivariant robust competitors. Limiting distributions 
of the standardized third and fourth moments and the resulting test statistics 
are derived under the null hypothesis and are shown to be strongly dependent 
on the choice of the location vector and scatter matrix estimate. Finally, the 
effects of the modification on the limiting and finite-sample efficiencies are il- 
lustrated by simple examples in the case of testing for the bivariate normality. 
In the cases studied, the modifications seem to increase the power of the tests. 

Mathematics Subject Classification (2000). Primary 62H15; Secondary 62F03. 
Keywords. Kurtosis, multinormality, robustness, skewness. 



1. Introduction 

Classical methods of multivariate analysis are for the most part based on the as- 
sumption that the data are coming from a multivariate normal population. In the 
one-sample multinormal case, the regular sample mean vector and sample covari- 
ance matrix are complete and sufficient, but on the other hand highly sensitive 
to outlying observations. It is also well known that the tests and estimates based 
on the sample mean vector and sample covariance matrix have poor efficiency 
properties in the case of heavy tailed noise distributions. Testing for departures 
from multinormality helps to make practical choices between competing methods 
of analysis. 
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In the univariate case, standardized third and fourth moments bi and &2 are 
often used to indicate the skewness and kurtosis. For a random sample a?i, . . . , a?n 
from a p- variate distribution with sample mean vector x and sample covariance 
matrix S', Mardia (1970, 1974, 1980) defined the p-variate skewness and kurtosis 
statistics as 

bi^p = diveij[{xi - x)^S~^{xj - x)]^ 

and 

&2,p = ave4(a?i - x)'^S~^{xi - x)f, 

respectively. The statistics bi^p and 62, p are functions of standardized third and 
fourth moments, respectively. Prom the definition one easily sees that 61, p and 62, p 
are invariant under affine transformations. In the univariate case these reduce to 
the usual univariate skewness and kurtosis statistics b\ and 62* Mardia advocated 
using the skewness and kurtosis statistics to test for multinormality as they are 
distribution-free under multinormality. See also Bera and John (1983) and Koziol 
(1993) for the use of the standardized third and fourth moments in the test con- 
struction. 

Intuitively, as in the univariate case, the skewness and kurtosis statistics 
61 ^p and 62, p compare the variation measured by third and fourth moments to 
that measured by second moments. The second, third and fourth moments are 
all highly non-robust statistics, however, and more efficient procedures may be 
obtained through comparisons of robust and nonrobust measures of variations. 
It is therefore an appealing idea to study whether it is useful or reasonable to 
replace the regular sample mean vector x and sample covariance matrix S by 
some affine equi variant robust competitors, location vector estimate fi and scatter 
matrix estimate E. 

The paper is organized as follows. In Section 2 the limiting distributions of 
Mardia’s test statistics under the null model of multinormality are discussed. The 
test statistics are based on the standardized third and fourth moments; the regular 
mean vector and regular covariance matrix are used in the standardization. Section 
3 lists the tools and assumptions for alternative y^-consistent affine equivariant 
location and scatter estimates. The location and scatter estimates are then used 
in the regular manner to standardize the data vectors. The limiting distributions 
of the standardized third and fourth moments in this general case are found in 
Section 4. As an illustration of the theory, the effect of the way of standardization 
on the limiting and small sample efficiencies in the bivariate case are studied in 
Sections 5 and 6. The auxiliary lemmas and proofs have been postponed to the 
Appendix. 



2. Limiting Null Distributions of Mardia’s Statistics 

In this section we recall the well-known results concerning the limiting null dis- 
tributions of the Mardia’s skewness and kurtosis statistics. Due to affine invari- 
ance, the distributions of the test statistics do not depend on the unknown and 
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E. Therefore it is not a restriction to assume in the following derivations that 
Xi, . . . , Xn is a random sample from the ATp(0, Ip) distribution. Write 

- x), i = 1 , . . . , n, 

for the data vectors standardized in the classical way. Mardia’s skewness statistic 
can be decomposed as 

— b ^ ^SiVQi^Zij ZiJ^Zii }]^ + 3^[avej{4'2ife}]2 + Y^[a.vei{zfj}\^ 
j<k<l j^k j 

and the Mardia’s kurtosis statistic is similarly 

&2,p = 5^avei{44} + 

j 

See, e.g., Koziol (1993). Under multinormality bi^p and 62, p are affine invariant, 
all the standardized third and fourth moments are asymptotically normal and 
asymptotically independent and consequently the limiting distributions of 



,-p{p + 2) 

^8p(p + 2) 



are a x^-distribution with p{p + l)(p + 2)/6 degrees of freedom and a N{0,1) 
distribution, respectively. 



3. Location and Scatter Estimates 

In Mardia’s test construction, the data vectors are standardized using the regular 
sample mean vector x and sample covariance matrix S. In this paper we thus con- 
sider what happens if these are replaced by other y^-consistent affine equivariant 
location and scatter estimates of and S, say, fi and E. Next we list assumptions 
and useful notations for location vector estimate ft and scatter matrix estimate E. 

First we assume that, for the Np(0,/p) distribution, the influence functions 
of these location and scatter functionals at z are 

7 (r) u and a{r) uu^ — j3{r) Ip, 

respectively, where r = ||z|| and u = ||z||“^z. We will later see that the limiting 
distributions of the modifled test statistics depend on the used location and scat- 
ter estimates through these functions. For the influence functions of location and 
scatter functionals, see also Hampel et al. (1986), Croux and Haesbroek (2000) 
and Ollila et al. (2003a, 2003b). 

We further assume that, again under the Np{0,lp) distribution, 
t = y/n dive{'y{ri)ui} = y/n fi-\- op{l) 

and 

C = y/n dive{a{ri)uiuj - /3{ri)Ip} = y/n {t - Ip) + op(l). 

Finally, assume that, for the Np{0,lp) distribution, the expected values E(7^(r)), 
E{a‘^{r)) and E{(3‘^{r)) are flnite. These assumptions imply that the limiting dis- 
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tributions of ^/n|l and ^/n vec{t, - Ip) are p- and p^-variate normal distributions 
with zero mean vectors and covariance matrices 

P ^ 

and 

^^[/p 2 + /p,p + vec(/p)vec(/p)^] 

+ + ^E{a{r)(3{r)]) vec(/p)vec(/p)^, 

respectively. (Here Ip^p is the so called commutation matrix. See, e.g., Ollila et al., 
2003b.) 

4. Limiting Distribution of the Standardized 
Third and Fourth Moments 

Consider now any estimators fi and E and write 

~ E ^ (®2 A)? i — 1, . . . , 77-, 

for the standardized observations. Denote the standardized third and fourth mo- 
ments by 

Ujki = Vn a.vei{zijZikZii}, 

Vjk = \/n ayei{zfjzfk - 1} 

and 

Vjj = y/n avei{zf^- - 3}. 

The limiting distributions of the standardized third and fourth moments are given 
by the following theorem. The results easily follow from Lemmas 7.1 and 7.2 stated 
and proven in the Appendix. 

Theorem 4.1. Under multinormality, the limiting distribution of the vector of all 
possible third moments Ujki (j < k < 1), Ujjk (j ^ k) and Ujjj is a multivariate 
normal distribution with zero mean vector and limiting variances and covariances 

Var{Ujki) = 1 
Var{Ujjk) = 2 + k 
V arfUjjj) = 6 + 9 k 
Cov{Ujjk,Uiik) = K 
Cov{Ujjk^Ukkk) — 3 k 

where 

E[j^{r)] ^E[r^j{r)\ 

K = 1 H z — — . 

p pip + 2 ) 

All the other covariances are zero. 
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Theorem 4.2. Under multinormality, the limiting distribution of the vector of 
fourth moments Vjk (j ^ k) and Vjj is a multivariate normal distribution with 
zero mean vector and limiting variances and covariances 

Var{Vjk) = 4 + T 1 + 2 T 2 
Var{Vjj) = 24 + 9ti + 36t2 
Cov{Vjk,Vji) = T 1 +T 2 
Cov{Vjk,Vmi) = n 
Cov{Vjj,Vkk) = 9ti 
Cov {Vjk,Vjj) = 3ti + 6 t 2 
Cov{Vjk,Vii) = 3ti 



where 

= 4 _ 2 

[p{p + 2) p ^ 



E[a{r)r'^] E[(3{r)r‘^Y 

p(p + 2){p + i) p{p + 2) _ 



and 

r 4 E[a{ry\ 

^ p(p + 2) p{p + 2){p + A)' 



We close this section with some discussion on the implications of the above 
results. First note that, if the regular mean vector and sample covariance matrix 
are used in the standardization, then simply 

7 (r) = r, a{r) = r^ and 0{r) = 1 which gives k = ti = T 2 = 0 



and all the third moments and fourth moments are asymptotically mutually in- 
dependent, respectively. For other location and scatter statistics this is not neces- 
sarily true and also the limiting variances may vary. This means that the limiting 
distributions and efficiencies of bi^p and 62 , p may drastically change. The limiting 
distribution of n 6 i^p /6 is a weighted sum of chi-square variables with one degree 
of freedom which makes the calculation of the approximate p- value less practical. 
The limiting distribution of \/n( 62 ,p — p{p + 2)) is still normal with mean value 
zero, but its limiting variance depends on the chosen scatter matrix. In the last two 
sections we illustrate the impact of the choice of the location vector and scatter 
matrix estimate on the limiting distribution and efficiency in the bivariate case. 
The extension to the general p-variate case is straightforward. 



5. Tests for Bivariate Normality 

Assume now that Xi, . . . , is a random sample from an unknown bivariate dis- 
tribution and we wish to test the null hypothesis Hq of bivariate normality. For 
limiting power considerations we use the contaminated bivariate normal distri- 
bution, the so called Tukey distribution: We say that the distribution of x is a 
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bivariate Tukey distribution T(A, /Lt, a) if the pdf of x is 

f(x) = (1 - A)^(a;) + - n)/a), 

where 0(x) is the pdf of A^2(0, h) distribution and A G [0, 1]. Alternative sequences 
Hi : T(A,/la, 1 ) and i/J • with A = (5/^/n, r = ||/Lt|| > 0 and a > 1 

are then used for the comparisons of the skewness and kurtosis test statistics, 
respectively. 

Write now 

Ti = (^7i125 C^ 221 j t^222)^ ^nd T 2 = (Vii, V 22 , Vi 2 )^ 

for the vectors of the standardized third and fourth moments. Using the results in 
Section 4, one directly gets the following results. 

Corollary 5.1. Under Hq, the limiting distribution of y/nT\ is a A-variate normal 
distribution with zero mean vector and covariance matrix 

/2000\ /1003\ 

0200 0130 

0060 0390 ■ 

\0006/ \3009/ 

Corollary 5.2. Under Hq, the limiting distribution of 2 is a 3-variate normal 
distribution with zero mean vector and covariance matrix 

/ 24 00\ /993\ / 36 06\ 

= 0 24 0 + Ti 9 9 3 + T 2 0 36 6 . 

\0 04/ \33l/ \6 62/ 

It is now possible to construct the limiting distributions of the affine invariant 
test statistics 61,2 and 62 , 2 - The comparison of the effect of different standardiza- 
tions is possible but not easy, however, as the limiting distributions may be of 
different type. This is avoided if one uses the following test statistic versions: 

Definition 5.3. Modified Mardia’s skewness and kurtosis statistics in the bivariate 
case are 

and bl^ = {l,l,2)T2. 

Note that, if the sample mean and covariance matrix are used, the definition 
gives 6 * 2 = ^ 1 , 2/6 and &2 2 — ^ 2,2 — 8 , respectively. In the former case, with general 
fi and E, the affine equi variance property is lost. 

The alternative to the regular sample mean vector and sample covariance 
matrix used in this example is the affine equivariant Oja median (Oja, 1983) and 
the related scatter matrix estimate based on the Oja sign covariance matrix (SCM). 
See Visuri et al. (2000) and Ollila et al. (2003b) for the SCM. Note also that the 
scatter matrix estimate based on the Oja SCM is asymptotically equivalent with 
the zonoid covariance matrix (ZCM) and can be seen as an affine equivariant 
extension on mean deviation. For the definition and use of the ZCM, see Koshevoy 
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et al. (2003). We chose these statistics as their influence functions are relatively 
simple at the N 2 { 0 ,l 2 ) case, the influence functions are given by 



and /3(r) = 1. 



7 (r) = 2y-, a{r) = 2 



7T 



The constants needed for the asymptotic variances and covariances are then 
4 1 32 29 , 16 14 

9’ = v “ t = 

7T Z 7T O no 

Consider first the alternative contiguous sequence iff with fi = (r, 0, . . . , 0)^. 
Using classical LeCam’s third lemma one then easily sees that, under the alterna- 
tive sequence iff, the limiting distribution of nh\ 2 is a noncentral x^-distribution 
with 4 degrees of freedom and non-centrality parameter where <Si de- 

pends on the location estimate used in the standardization. In the regular case 
(sample mean vector), 

- (0, 0, 0)^ 

and the Oja median gives 

Si = (0, r - q{r)2y/2jn, + 3r- q{r)6y^2jn, 0)^. 

The function q{r) = | |i?(/x) 1 1 is easily computable and defined through the so-called 
spatial rank function 

R{fi) = E{\\iJ,-x\\-^{n-x)) 
with X ~ AT2(0,/2)- See Mottonen and Oja (1995). 

Next we consider the sequence iff* Then 



Uar(6y = (1,1,2) ^2 




and under iff limiting distribution of n (&2 2 )^ ^'^{^ 2 , 2 ) i^ ^ noncentral x^- 
distribution with one degree of freedom and non-centrality parameter 
[(1, 1, 2 ) 82 ]^ /V ar ( 1)2 2)1 where ^2 in this case depends on the scatter estimate used 
in the standardization. In the regular case (sample covariance matrix) we get 

S 2 = (3(^2 - l)^ 3(a^ - l)^ (^2 - Iff 

and the ZCM gives 

(52 = (3(a^ - 1) - 12(a - 1), - 1) - 12 {a - 1), - 1) - i{a - 1))^ 

In Figure 1 the asymptotical relative efficiencies of the new tests (using the 
Oja median and the ZCM estimate) relative to the regular tests (using the sample 
mean and sample covariance matrix) are illustrated for different values of r and for 
different values of a. The AREs are then simply the ratios of the non-centrality 
parameters. The new tests seem to perform very well: The test statistics using 
robust standardization seem to perform better than the tests with regular location 
and scatter estimates. 
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Figure 1 . Limiting efficiencies of the tests based on the Oja 
median and the ZCM relative to the regular Mardia’s tests for 
different values of r and a. The left (right) panel illustrates the 
efficiency of 6* 2 (^2,2)* 



6. A Real Data Example 

In our real data example we consider two bivariate data sets of n = 50 observations 
and the combined data set (n = 100 ). The data sets are shown and explained in 
Figure 2 . We wish to test the null hypothesis of bivariate normality. Again the 
skewness and kurtosis statistics to be compared are those based on the sample 
mean vector and sample covariance matrix (regular Mardia’s 61,2 and 62,2) and 
those based on the Oja median and the ZCM (denoted by 6* 2 ^2,2)* The values 

of the test statistics and the approximate p-values were calculated for the three 
data sets (‘Boys’, ‘Girls’, ‘Combined’). See Table 1 for the results. As expected, 
the observed p- values were much smaller in the combined data case. 

As discussed in the introduction, the statistics 61,2 and 62,2 compare the third 
and fourth central moments to the second ones (given by the sample covariance 
matrix). The statistics h\ 2 and 63 2 robust location vector and scatter matrix 
estimates in this comparison. One can then expect that, for data sets with outliers 
or for data sets coming from heavy tailed distributions, the comparison to robust 
estimates tends to produce higher values of the test statistics and smaller p- values. 

In our examples for the analysis of bivariate data we used the Oja median and 
the ZCM to standardize the data vectors. In the general p- variate case (with large 
p and large n) these estimates are, however, computationally highly intensive, and 
should be replaced by more practical robust location and scatter estimates. 






On Mardia’s Tests of Multinormality 



161 




Figure 2. The height of the mother and the birth weight of the 
child measured on 50 boys who have siblings (o) and 50 girls (x) 
who do not have siblings. 

Table 1. The observed values of the test statistics with corre- 
sponding p- values for the three data sets. 





Boys 




Girls 




Combined 




Test statistic 


p- value 


Test statistic 


p- value 


Test statistic 


p- value 


^>1,2 


3.034 


0.552 


3.376 


0.497 


9.242 


0.055 


bU 


5.152 


0.272 


2.336 


0.674 


11.257 


0.024 


b2,2 


0.016 


0.900 


0.282 


0.596 


2.275 


0.132 


bU 


0.834 


0.361 


1.152 


0.283 


5.166 


0.023 
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7. Appendix: Auxiliary Lemmas 

The limiting distributions of the standardized third and fourth moments are given 
by the following linear approximations. 

Lemma 7.1. Let Xi, . . . , a?n be a random sample from iV(0, Ip)-distribution and let 
Zi = — fi), i = be the standardized observations. Write also 

n = \\xi\\ and Ui = Then 

Vn 3yei{zijZikZii} = yjn eivei{xijXikXii} + op{l) 

= siyei{r^UijUikUii} + op(l) for j <k <1, 

^Jn diYei{z‘^jZik} = \/n 3Yei{xfjXik} -tk + op(l) 

= y/n 8Yei{rfu^jUik - 7(n)^ifc} + op{l) for j 



and 

^ dY^i{zl^] = dYei{x^j} - 3tj + op(l) 

= x/n SYei{r^u^j - 3^{ri)uij} + op(l). 

Lemma 7.2. Let xi,...,Xn be a random sample from N{0, Ip) -distribution and 
let Zi = — fi), i = be the standardized observations. Again, 

= ||xi|| and Ui = \\xi\\~^Xi. Then for j^k, 

y/n QYei{z‘^jz‘fj^ -l} = y/h sYei{x^ijX^^j^ - 1} - Cjj - Ckk + op(l) 

= y/h &Yei{rfu}^u^ii, - 1 - a{ri)[afj + u‘fj,] -h 2/3(n)} + op(l) 

and 



y/n dY^i{zfj — 3} = x/n ave^jx^^- — 3} — 6Cjj + op{l) 

= y/h divei{rfu^j - 3 - Qa{ri)u‘fj + 6^(ri)} + op(l). 

Proofs of the lemmas: Note first that 



= S - (x) = Xi \=t - -^f=Cxi + o(l/-/n). 



2 v^ 



J_i —Yc X- + 

x/n 2y/n ^ 

V V 



o(l/v^), 



2 1 ^ A 

= xfj - - -J^ 'Y^CjrXirXij + o(l/y/n), 



^ r=l 



ij ij 



^ j. 2 ^ ^ 2 

^ ^ + 01 

V V r=l 






Thus 
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and 

4 2 ^ A 

4 = 4 “ CjrXirX^ + o{l/^/^). 

V ^ r=l 

As the observations come from the Np{0,lp) distribution (all the moments ex- 
ist), the standard law of large numbers and central limit theorem together with 
Slutsky’s lemma yield the result. 
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Robustness in Sequential Discrimination 
of Markov Chains under “Contamination” 

A. Kharin 



Abstract. The problem of robustness is considered for sequential hypotheses 
testing on the parameters of Markov chains. The exact expressions for the con- 
ditional error probabilities, and for the conditional expected sequence lengths 
are obtained. Robustness analysis under “contamination” is performed. Nu- 
merical results are given to illustrate the theory. 

Mathematics Subject Classification (2000). Primary 62F35; Secondary 62L10, 
62M02. 
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1. Introduction 

The sequential approach (Wald, 1947) is used in applications, especially in medical 
trials (Whitehead, 1997), and in statistical quality control, for testing of hypothe- 
ses on parameters of models (Siegmund, 1985). One of the applied problems of 
sequential testing is the problem of discrimination of Markov chains (Durbin et 
ah, 1998; Wu et ah, 2001). 

Testing of simple hypotheses on parameters of Markov chains is realized 
by the sequential probability ratio test, which minimizes the expected sequence 
length. But such a model is usually not adequate to the applied problem. A possi- 
bility of some deviations from the model is needed. Sequential testing of composite 
hypotheses leads to significant increase in the expected sequence length, even for 
asymptotically optimal tests (Malyutov and Tsitovich, 2000). 

An alternative way to work with the deviations from the model of simple 
hypotheses is to analyze robustness (Huber, 1981; Stockinger and Dutter, 1987) of 
the test in the case of simple hypotheses under distortions of the model (Kharin, 
2002). In the paper we analyze robustness of the sequential hypotheses testing 
under “contamination” (Huber, 1981). 
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2. Mathematical Model and Notation 

Let a homogeneous Markov chain o:i , X 2 , . . . , be observed, G F = {0, 1 , . . . , M — 
1}, i eN, with an initial states probabilities vector tt = i eV^ and with a one- 
step transition probabilities matrix P = (pij), i^j^V: P{xi = i} = TTi, P{xn = j \ 
Xn-i =i} =Pij^hj ^ V^n> 1. Let us consider two hypotheses on the parameters 
of the Markov chain: Hq: n = P = P^^\ with an alternative Hi: n = 
7 t(i)^ P = p(i)^ where are given values of the initial states probabilities 

vector, ^ are given matrices of transition probabilities for 

the corresponding hypotheses. Introduce the notation: 

(1) (1) n 

Ai=ln^, fc > 1; A„ = J]Afc, neK 

Pxk-lXk k=l 



Let us describe the sequential test for the hypotheses Pq? Pi- Suppose 
C'_,C+ G Z, C- < 0, (7+ > 0 be the parameters of the test (thresholds); 
P = C-i- — C_ — 1. The hypothesis Hq is accepted after n observations, if An < C-, 
the hypothesis Hi is accepted after n observations, if An > C-i-, else the test should 
be continued, it means that the (n + l)-th observation should be made. The se- 
quence of random vectors (An,Xn)', n G N, is a Markov chain (Kemeni and Snell, 
1959) : by the definition of An, n G N, and also because of Xn, n G N, is a Markov 
chain. 



^ {An? Xn I An— 1 , An— 2 ? • • • 5 A^; Xn— I? ^n— 2 ? • • • 5 ^l} — ^ {An? | An— I? Xn— l} • 



Let the parameters P^^\ 7t^^\ P^^^ satisfy the condition: 



a > 0, m^, rriij G 



i,j G V, that In = nriia, In = rriija. 

TT- V-- 



In practice, these equations can be approximated with the necessary accuracy 
by choosing the value of a small enough. Let be the conditional expected 
sequence length under the condition that the hypothesis Hk is true, fc G {0, 1}; let 
a, /? be the conditional error probabilities (error probabilities of the type I and II 
respectively). 



3. Exact Expressions for Conditional Error Probabilities and 
Expected Sequence Lengths 

Let us denote by 

l2 



R{k) 



^ {w\f) = 



02XMN 

Q{k) 
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the matrix of the size (MAT + 2) x {MN + 2), where the blocks are 

defined by their components: 

^Mi+s Mj+t = - h'n^st)p^st^ hi e (C'-,C'+), S,t£V, 

(k) _ j = C-I-, i G (C_, (7-|_), s G V, (q-\\ 

j = C_,iG(C_,C+),sGF, ^ 

where 5(-, •) means the Kronecker delta function: 5{k, /) = 1, if fc = /, and 5(fc, 1) = 
0, otherwise, fc,Z G N; l(-) means the unit step function: l{u) = 1, if u > 0, and 
l{u) = 0, otherwise. Define the vector i G {MC- + 1, . . . , MC+ — 1} 

by its components: 

^Mi+s = i € (C_,C+), s 6 V, (3.2) 

and define 

‘^MC+ = - C'+)7rf^. (3.3) 

s€V s 6 V 

Define also the matrices = Imn ~ k G {0, 1}. 

Theorem 3.1. For the model described above 

a = P = ; 

fW ^(^W)/(5(fc))-il^^ + l_ 



Proof. It is based on the theory of absorbing Markov chains (Kemeni and Snell, 
1959). This theory is applied to the sequence 



- MC'_l(_oo,C_] (^) + -WC+llc+.+oo) (^) 

+ (^M + x„)l(c_,c+)(^), neN, 



(3.4) 



which is a Markov chain with MN-j -2 states. Two of these states, MC- and MC+, 
are the absorbing states. The transition probabilities matrix is given by (3.1). The 
vector of initial states probabilities is given by (3.2), (3.3). Note that 7 ^ 0, 
k G {0,1}, because 5^^^ is the fundamental matrix (Kemeni and Snell, 1959) of 
the Markov chain (3.4). □ 



4. Robustness Analysis 

Let us analyze the robustness of the sequential test under a “contamination” (Hu- 
ber, 1981) in observations. Let the described above hypothetical model be dis- 
torted: the initial states probabilities vector and the transition probabilities matrix 
are given by 

:^( 1 - = (1 - e)p('') + £p('^\ (4.1) 

where and P^^^ ^ P^^^ are the initial states probabilities vector and the 
transition probabilities matrix for the “contaminating” Markov chain, e G [0, |) is 
the probability of “contamination” presence (Huber, 1981). 
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Define and k E {0,1}, similarly to the hypo- 

thetical case, replacing and with p\j^ and respectively. Define the 
matrix = Imn ~ ~ k E (0, 1}. 

Theorem 4.1. If the hypothetical model is distorted according to (4.1), then the 
conditional error probabilities a, P, and the conditional expected sequence lengths 
differ from the corresponding hypothetical characteristics by the values of 
the order 0{e): 

x(5W)-'lMiv + 0 (a, 

a-a^ e^(a;(o))' (( 5 ^)"^ ((qW - (5(0))“^ 

+ - w(0))' 

(4.3) 

0-13 = (( 5 ( 1 ))“^ 

+ - w^|-_) + 0{e'^). 

(4.4) 

Proof It is based on the fact that under the distortions (4.1) the transition prob- 
abilities matrix and the initial states probabilities vector for the sequence (3.4) 
have the forms 

^(fc) = ^r{k) _ ^r{k)^ ^ ^(fc) ^ Jk) ^ g^-(fc) _ 

□ 

Theorem 4.1 allows to approximate linearly the characteristics of the test 
under “contamination” using the result of Theorem 3.1. In Theorem 4.1 the coef- 
ficients of £ in (4.2)-(4.4) are bounded for all possible values of and 

this leads to the fact that the linear term of the infiuence function (Hampel et al., 
1986) is bounded. 



5. Simulation Results 

To illustrate the accuracy of linear approximations (4.2)-(4.4) under the “con- 
tamination” (4.1) for the conditional error probabilities and for the conditional 
expected sequence lengths, we performed statistical simulation. 
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The experiments are done for the case, where M = 5, with the following 
values of the parameters of the hypothetical and “contaminating” Markov chains: 

1 1 1 1 1 \ 

f ! ! ! n 

f f f ! ! 

f ! ! f f ’ 

f f ! ! ? 

4 4 4 8 8 / 

i i i i i \ 

f f ! ! ! 1 

f f f f ! 

f f f ! ! ’ 

f f ! f ! 

8 8 4 4 4 / 

i i i J_ 

! ! ! ^ ¥ 

i ! ! ¥ 

^ i ! ! 

¥ ¥ ! ! ! 

16 16 8 4 2 

JL J_ i i i \ 

¥ ¥ f ? f ] 

¥ ¥ ! ! i 

¥ f ? i ¥ • 

1 ! ! ¥ ¥ 

2 4 8 16 16 / 

In this example, a = ln2, and the values of rrii, m^j, i, j G {0, 1, ... ,4} can 
be calculated easily according to Section 2. 

Let ao, /?o be the requested values of the error probabilities of the type I 
and II respectively, d(5) = ^ estimates of the error 

probabilities of the type I and II, which are obtained by the Monte-Carlo simu- 
lation under “contamination” with the probability of “contamination” presence e, 
where Nhi is the total number of realizations under is the numbers of 

decisions in favor of Hk from the total number of Nhi realizations (fc,/ G {0, 1}, 
k ^ 1). Let a{£), P{e) be the linear approximations of the conditional error prob- 
abilities a, computed according to (4.3), (4.4); be the estimates 

of the conditional expected sequence lengths, which are obtained by statistical 
simulation, be the linear approximations of the characteristics 

computed according to (4.2). 

The thresholds C+, C- were computed by the formulae (Wald, 1947): 






C+ = [Lin 1—^1 , C- = [- -In— 

[a o;o J [a 1 — ag 



where [•] means the integer part. The number of experiments with each parameter 
set was equal to Nhq = Nh^ = 10000. The results are presented in the Table 1. 
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Table 1. Results of the experiments. 



e 


a{e) 


a(e) 


he) 


m 




m{e) 












ao = 


0.01, 00 


= 0.01 








0.01 


0.0146 


0.0178 


0.0090 


0.0090 


34.085 


34.733 


29.282 


29.780 


0.05 


0.0248 


0.0268 


0.0167 


0.0143 


36.663 


38.512 


31.861 


33.260 


0.2 


0.1055 


0.0607 


0.0744 


0.0340 


50.010 


52.684 


43.951 


46.310 








Qfo = 


0.01, ^0 = 0.1 








0.01 


0.0139 


0.0167 


0.0621 


0.0675 


19.372 


19.616 


26.642 


26.847 


0.05 


0.0236 


0.0250 


0.0909 


0.0910 


21.098 


21.736 


28.210 


28.911 


0.2 


0.0890 


0.0559 


0.2110 


0.1790 


27.428 


29.686 


33.653 


36.547 








OtQ = 


0.05, 00 = 0.1 








0.01 


0.0631 


0.0642 


0.0587 


0.0642 


17.385 


17.796 


17.186 


17.558 


0.05 


0.081 


0.0856 


0.0792 


0.0856 


18.742 


19.091 


18.160 


18.821 


0.2 


0.1830 


0.1660 


0.1870 


0.1660 


21.847 


23.947 


21.271 


23.558 








OtQ - 


= 0.1, 00 


= 0.1 








0.01 


0.1140 


0.1260 


0.0589 


0.0599 


15.636 


15.882 


12.763 


12.971 


0.05 


0.1520 


0.1570 


0.0815 


0.0793 


16.391 


16.612 


13.557 


13.880 


0.2 


0.2680 


0.2750 


0.1681 


0.1518 


18.220 


19.349 


15.838 


17.291 



On the basis of the results of experiments one can conclude that the accuracy 
of the linear approximation w.r.t. e is satisfactory under “contamination” of the 
hypothetical model. 

Our previous results for the i.i.d. random variables (Kharin, 2002) agree with 
the results presented in this paper, as a singular case. 

For the values of e large enough, the conditional error probabilities a{e), a{e), 
0{e), 0{e) differ from the targeted values oq and /3 q significantly (see Table 1, the 
case of £ = 0.2). This fact means that a robust sequential test should be used in 
this case. 

Note in conclusion, that by Theorems 3.1, 4.1, and by the approach proposed 
in Kharin (2002), the minimax robust sequential test can be constructed. 
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Robust Box- Cox Transformations 
for Simple Regression 

A. Marazzi and V.J. Yohai 

Abstract. The use of the Box-Cox family of transformations is a popular 
approach to make data behave according to a linear regression model. The 
regression coefficients, as well as the parameter A defining the transformation, 
are generally estimated by maximum likelihood, assuming homoscedastic nor- 
mal errors. These estimates are nonrobust; in addition, consistency to the true 
parameters holds only if the assumptions of normality and homoscedasticity 
are satisfied. We present here a new class of estimates, for the case of sim- 
ple regression, which are robust and consistent even if the assumptions of 
normality and homoscedasticity do not hold. 

Mathematics Subject Classification (2000). Primary 62J05; Secondary 62G35. 
Keywords. Box-Cox transformations, robust estimation, heteroscedasticity. 

1. Introduction 

The Box-Cox family of transformations has become a widely used tool to make 
data behave according to a linear regression model. Sakia (1992) has given an 
excellent review of the work relating to this transformation. The response vari- 
able, transformed according to the Box- Cox procedure, is usually assumed to be 
linearly related to its covariates and the errors normally distributed with con- 
stant variance. The regression coefficients, as well as the parameter A defining 
the transformation, are generally estimated by maximum likelihood (ML). Unfor- 
tunately, near normality and homoscedasticity are hard to attain simultaneously 
with a single transformation. In addition, the ML-estimate is not consistent under 
non-normal or heteroscedastic errors and it is not robust. 

Carroll and Ruppert (1988) proposed bounded influence estimates based on 
the normal homoscedastic model to limit the influence of a moderate number of 

This work was completed with the support of Grant 2053-066895.01 from the Swiss National 
Science Foundation, Grant PICT99 0306277 from the Agencia Nacional de Promocion de la 
Ciencia y la Tecnologia, Grant X611 from the Universidad de Buenos Aires, Argentina, and a 
grant from Fundacion Antorchas, Argentina. 
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outliers. Within their approach, heteroscedasticity was modelled with the help 
of a weighting system. Moreover, various semiparametric and nonparametric ap- 
proaches to relax the parametric structure of the response distribution have been 
studied. References concerning these proposals can be found in Foster et al. (2001), 
where a semiparametric procedure based on a minimum distance criterion between 
a semiparametric and a nonparametric estimate of the error distribution is devel- 
oped. In general, these approaches do not provide effective protection against heavy 
contamination and heteroscedasticity. 

In this paper we present a new class of estimates for the case of simple 
regression which are robust and consistent even if the assumptions of normality 
and homoscedasticity do not hold. The new estimates are based on minimization 
(with respect to A) of a measure of autocorrelation among the residuals with 
respect to a robust estimate of the regression coefficients (for given A). In order 
to compute the autocorrelation, the residuals have to be ordered according to 
the values of the regressor. We use the residual autocorrelation as a measure of 
functional relationship between the residuals and the regressor. This measure is 
reminiscent of a proposal by Maravall (1983) to detect nonlinearity in a time series. 



2. The Model 



We consider a bivariate random sample (xi, yi), . . . , (x^, t/n) and assume that the 
response and the explanatory variables are linked by the linear relationship 

= ao + l3oXi + h{xi)ui, (2.1) 



where ao, /?o and Aq are real parameters and yi > 0. The function h(-) is unknown 
and 



y 



(A) 






( 2 . 2 ) 



(y^-l)/A if A 7^0, 

ln(A) if A = 0, 

denotes the usual Box- Cox transformation. We assume that the errors Ui are i.i.d. 
according to a common cdf F, that Ui is independent of x^, and E{ui) = 0. 



3. The Robust Autocorrelation Estimate 

Suppose that (an, /3n) are robust estimates of intercept and slope for the simple 
linear regression model, e.g., MM-estimates (Yohai, 1987). Let (an(A), /?n(A)) be 
the result of applying these estimates to the responses for a given A and Sn(A) 
a robust measure of the residual scale, e.g., the one used to compute the MM- 
estimate (usually, the initial high breakdown point S-estimate of scale; Rousseeuw 
and Leroy, 1987). If the residuals are computed using the true parameter Aq, 
their conditional mean is close to zero for all values of x. On the other hand, 
when the residuals are computed using a A Aq, there is a functional relationship 
between the residual conditional mean and x. A suitable value of A should therefore 
minimize a measure of dependency between residuals and x. One such measure is 
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the robust residual autocorrelation Pn{^)- Its definition depends on whether there 
are tied a;- values or not. 

3.1. The Case Without Tied x-values 

Firstly, we suppose that all the values xi, . . . , Xn are distinct. In this case /?n(A) is 
given by the following steps: 

1. Sort Xj, z = 1, . . . , n in ascending order and let zi, . . . , in be the corresponding 
permuted indices. Thus, x^^ < x ^2 < * * * < Xi ^ . 

2. For a given A, let rj{\) = - On(A) - Xi./3n{X) (j = 1, . . . , rz) and 

3. Set 

^ n— 1 

= (3.1) 

where '0(-) is any monotone, odd, and bounded function, e.g., the well-known 
Huber’s function. 

3.2. The Case With Tied x-values 

We now assume that there are only k distinct x-values and that rzi observations 
are equal to the smallest, rz 2 to the next smallest, . . ., Uk to the largest. Thus, 
rzi + rz 2 H h = n. In this case, we modify the terms in (3.1) as follows: 

1. Replace the first rzi - 1 terms by 

/ . V ni-1 ni 

E E 

V 2 / i=l 

2. Replace the term z;ni(A)z;ni+i(A) by 

ni+ri2 ni 

— E E»*w*’jW- 

2=m+l j—l 

3. Replace the following rz 2 — 1 terms by 

/ H \ ni+n2-l ni+722 

V 2 / i=m + l 

and so on. Note that this replacement procedure is equivalent to arbitrarily per- 
muting the tied observations and computing their correlations by averaging over 
the permutations. 

The robust autocorrelation is positive when there is dependency between 
residuals and the regressor x; otherwise it is close to zero. However, the equation 
Pn{\) = 0 may have more than one or no solutions. Therefore, we define the robust 
autocorrelation estimate of Aq as the global minimum of pn{X). Another proposal 
for robust autocorrelation can be found in Ma and Genton (2001). 
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Remark 3.1. We tried to extend the robust autocorrelation estimate to the case 
of several covariates. For that purpose the residuals were sorted according to the 
fitted values (in place of the a:- values). Unfortunately, simulation results showed 
that the performance of this proposal is not satisfactory. The reason of this poor 
performance may be due to the fact that the fitted value corresponding to a value 
A 7^ Ao may be almost uncorrelated with the fitted value associated to A = Aq. 

4. Consistency 

For simplicity, we consider the model given by (2.1) and = y^. If A 7^ 0, this 
model is equivalent to the one given by (2.1) and (2.2). We prove consistency 
of the robust autocorrelation estimate supposing that A G A = [/i,/2] with /i > 0 
and I 2 < 00. We make the following assumptions. 

(Al) In the model (2.1), (?ii,xi), (^2,^:2 ), . . . , {un^Xn) are i.i.d. and Ui and Xi are 
independent. The distribution F of Ui is continuous, symmetric, and E{ui) = 0. 
The distribution G of X{ is continuous. 

(A2) h is continuous. 

(A3) 'ip is continuous, odd, monotone, and bounded. 

(A4) (Uniform consistency of the estimates /3n, and Sn). For all A G A, there 
exist real numbers ei{\) > 0 , sq > 0 and continuous functions o;(A), P{X), s{X) 
such that a(Ao) = ao, /3(Aq) = Po, s{X) > sq > 0, and 

p lim sup |o;n(A*) — a(A*)| = 0, 

^■^^|A*-A|<£i(A) 

p lim sup \Pn{X*) - p{x*)\ = 0, 

n-.oo|A._A|<£i(A) 

p lim sup |sn(A*) - s(A*)| = 0, 

where p lim denotes convergence in probability. 

(A5) (Robust identifiability condition). Let 

r(A, a, P,u,x) = y^ — a — px = (ao + Pqx + — a — px, 

d(A, a, /?, s, x) = E{^|;{r{X,a,p,u,x)/s)\x), 
where x and y denote generic values of the covariate and the response. Then, 
for any A 7 ^ Aq, o;, /3 and s > 0, there exists an interval 7(A,a,/3, s) such that 
P{x G /(A, a, /?, s)) > 0 and for all x G /(A, a, /?, s) 

d(A,a(A),/3(A),s,x) 7^0. 

We consider the estimates: 

An = argminpn(A), an = O^n(An), pn = Pn{Xn)’ 
xeA 

Theorem 4.1. Under (Al)-(A5), A„ — > Aq, o;„ — > ao, and Pn Po in probability. 
The proof is given in the appendix. 
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Remark 4.2. Suppose that (A5) does not hold. Then, there exist values Ai ^ Aq, 
ai, /?i and s such that, with probability one, = oi + ^\x + ix, where u satisfies 
the condition E{iIjs{u)\x) = 0 with = 'ip{u/s). But, according to (Al), the 
model ( 2 . 1 ) satisfies the same condition and, therefore, it is not identifiable. When 
h{x) = 1 and /?o = 0 , this condition does not hold; in this case Aq in model ( 2 . 1 ) 
is clearly not identifiable. 

Remark 4.3. At present, we did not check that the MM-estimates and the scale 
estimates we are suggesting for On, /?n? and Sn satisfy the uniform consistency 
condition (A4). However, this assumption seems very plausible. The methods of 
Berrendero and Zamar (2003) could be used to prove this property. 



5. Empirical Results 

Bivariate observations {xi^yi) were generated according to the model 

y^^ = oo + PoXi + h{xi)ui, 2 = 1 , . . . , n, 

where the Ui were i.i.d., such that Ui = bei, E{ei) = 0, the distribution of 
was uniform C/(— 0.5,0.5), normal AT(0, 1), Student tQ, ^ 3 , contaminated normal 
0.9iV(0, 1) + 0.1AT(0,25), or exponential, and the factor b was chosen so that 
MAD(i/i)=l/3. In addition, ao = 10 , po = 2, Aq = 0.5, n = 100, Xi = 0.2 • i 
(i = 1, . . . , n). Two options for h{x) were used: h{x) = 3 (homoscedastic case) and 
h{x) = x/2 (heteroscedastic case). Each experiment was based on 1000 samples. 
The parameters oq, Po and the function h were chosen in order to have posi- 
tive responses, clear identifiability, and perceptible homo/heteroscedasticity. The 
choice of Aq was not relevant, because of the equivariance of A^ (An(yf , . . . , y^) = 
An(yi, . . . ^yn)/o)- Various parameter sets provided similar results. 

The results are reported in Table 1 and Table 2 and include the average 
bias of the simulated estimates (bias=1000xmean(An - Aq)) the root of the mean 
squared error (^mse =1000 x (mean(An - Aq)^)^'^) and the median absolute de- 
viation with respect to Aq (mde=1000xmedian(An — Aq)/0.6745). RAC denotes 
the robust autocorrelation estimator defined in Section 3, where 0 !n(A), /3n(A), are 
MM-estimators (with breakdown point 0.5 and efficiency 0.95 for normal errors, as 
defined in Yohai, 1987) and s„(A) the robust measure of the residual scale provided 
by an initial S-estimate (Rousseeuw and Leroy, 1987, p. 135, with tuning constant 
c = 1.547). AC denotes the autocorrelation estimator when On(A), Pn{^), and 
Sn{\) are least squares estimates. F is the estimator defined in Foster et al. (2001) 
and RF is a modification that uses the MM-estimates On(A), /?n(A) mentioned 
above in place of least squares estimates. BI denotes the bounded influence esti- 
mator as defined in Carroll and Ruppert (1988, p.l 86 , with tuning constant a = 1.5 
and based on 10 iterations). ML denotes the maximum likelihood estimator. 
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Table 1. Simulation results: homoscedastic case. 





RAC 


AC 


F 


RF 


BI 


ML 




bias 


0 


0 


0 


0 


-2 


0 


Unif 


y/mse 


22 


20 


22 


21 


23 


18 




mde 


22 


21 


22 


21 


23 


18 




bias 


-1 


-1 


0 


0 


-3 


0 


Gauss 


y/mse 


28 


26 


27 


26 


26 


23 




mde 


28 


26 


28 


27 


24 


23 




bias 


0 


0 


1 


1 


-4 


0 


t6 


V^inse 


29 


30 


29 


28 


26 


29 




mde 


30 


31 


30 


29 


27 


29 




bias 


-2 


-1 


-1 


-1 


-6 


1 


t3 


y/mse 


31 


46 


30 


29 


27 


54 




mde 


31 


36 


28 


28 


27 


45 




bias 


0 


-2 


-1 


-1 


-4 


0 


CntG 


v^inse 


30 


48 


31 


30 


27 


54 




mde 


29 


46 


31 


30 


26 


56 




Ime 


-2 


-1 


-1 


-1 


2 


-26 


Exp 


^ymse 


28 


38 


24 


22 


26 


44 




mde 


27 


36 


23 


22 


27 


44 



In the cases of homoscedastic short tailed error distributions, the performance 
of the various estimators was similar. For long tailed distributions, BI, RAC and 
RF provided effective and similar protections (RF being slightly superior for ex- 
ponential errors). In the heteroscedastic symmetric cases, only the autocorrelation 
estimates RAC and AC were not biased and the robustness of RAC was very satis- 
fying. Unfortunately, all estimates (with the exception of AC) were strongly bieised 
in the asymmetric heteroscedastic case. 

Remark 5.1. We used FORTRAN programs loaded into S-plus to evaluate the 
initial estimates On(A), /?n(A), and Sn(A) on a grid of 101 equally spaced points 
between 0.05 and 1.25; linear interpolation was used for other values of A. On a 
733 MHz Pentium III, the computing time (seconds) to obtain the initial estimates 
on the grid was 0.22 (n = 20), 1.60 (n = 50), 10.77 (n = 100), 78.14 (n = 200). 
The computing time for minimizing p{\) was 0.05 (n = 20), 0.08 (n = 50), 0.15 
(n = 100), 0.27 (n = 200). 



6. Appendix 

For any A G A, u, ia*, x, and £ > 0 we define 

t{X^u,u* ^x,e) = inf V^(r(A',o;',/3',ii,x)/s')'0(r(A',Q;',/3',ii*,x')/s'), 

D{X,£,x) 
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Table 2. Simulation results: heteroscedastic case. 





RAC 


AC 


F 


RF 


BI 


ML 




bias 


1 


2 


-34 


-34 


-120 


-136 


Unif 


v'mse 


45 


40 


62 


54 


124 


139 




mde 


44 


40 


64 


53 


179 


205 




bias 


2 


3 


-33 


-32 


-142 


-192 


Gauss 


v'mse 


46 


50 


62 


54 


146 


195 




mde 


46 


49 


61 


52 


207 


285 




bias 


5 


6 


-30 


-30 


-149 


-222 


t6 


v^mse 


49 


59 


63 


55 


153 


226 




mde 


50 


58 


61 


54 


221 


332 




bias 


1 


6 


-36 


-34 


-158 


-234 


t3 


v^mse 


47 


97 


66 


56 


162 


252 




mde 


44 


71 


68 


56 


233 


365 




bias 


3 


8 


-36 


-33 


-146 


-216 


CntG 


y'mse 


48 


122 


68 


58 


150 


242 




mde 


45 


93 


67 


58 


216 


339 




Ime 


22 


6 


-56 


-54 


-171 


-313 


Exp 


Y^mse 


70 


80 


87 


74 


179 


319 




mde 


71 


72 


94 


82 


255 


467 



where D(A, e, x) is the set of all A', a', j3' ^ s', x' that satisfy the following conditions: 
|A' - A[ < 6, |a' - a(A')| < e, |/?' - P{\')\ < e, \s' - s(A')| < e, \x' - x\ <s. 

Lemma 6.1. The following properties hold: 

(i) d(A, a, /?, s, x) is continuous; 

(ii) d(Ao,ao,/?o,5,x) = E{i^{h{x)u/s)\x) = 0. 

Proof (i) follows from the Dominated Convergence Theorem and (ii) it is imme- 
diate. 

Lemma 6.2. Let, u, u*, x be independent variables such that u and u* have dis- 
tribution F and x distribution G. Then, for a given A ^ Aq, there exists 6:2 (A) < 
6 i(A), such that £^(t(A,u,u*,x, 62(A))) >0. 

Proof By (A5), 

P(E(V^(r(A,a(A),/?(A),u,x)/s(A))|x) 0) > 0 

and then 

E{t{X,u,u* ,x,0)) = P(P^('0(r(A, a(A),/?(A),u,x)/s(A))|x)) > 0. 

The Lemma follows from the continuity of t{\, u, u*,x, e) and the Dominated Con- 
vergence Theorem. 
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Lemma 6.3. Let xi,X2 , . . . be a random sample of a distribution G, and let 
^(1) ^ ^(2) < ••• < ^(n) be the ordered sample. Define di = i = 

1, . . . ,n — 1 and mn{e) = #{i : di > e}/n. Then, for any e > 0, mn{s) 0 a,s. 
as n oo. 



Proof Take 5 > 0 and let M such that P{\x\ > M) < 5. Then 

_ / #{i: di> s,\xi\> M} , if{i : di > s,\xi\ < M} 

_ I 



Therefore, 



n n 

^ ^ 1 2M 

“ n n e 



r / ^ / r • \^i\ > ^ a 

lim sup mn{£) < hm < o, 

n — >^oo ^ ri 



Since this inequality holds for any 5 > 0, the Lemma follows. 



Proof of Theorem 4-2 

To prove that plimn_.oo An = Aq, it will be enough to show that: 

(a) If 5 > 0 and L = {|A - Aq| > 5}, then there exists z > 0, such that 

plim inf inf Pn(A) > z; 

n— ^oo XeL 

(b) plimp„(Ao) = 0. 

We start proving (a). Since we are assuming that the distribution G of x is con- 
tinuous we can assume that all the x^’s are different. We can write 

^ V •Sn(A) / \ S„(A) J 

where X(^) = Xj., U(^i) = Uj. and ji, j 2 , • • • , jn are the indices such that Xj^ < 
^32 < ' < For any A ^ Aq, let 62(A) be defined as in Lemma 6.2. According 

to the Heine-Borel theorem, we can find Ai, . . . , A^ in L, such that L C 
where Lj = {A : |A — Aj| < 62(A)}. It is therefore enough to show that there exist 
numbers Zj > 0, j = 1, . . . , fc, such that 

plim inf inf Pn(A) > Zj, j = 1, . . . ,k. 

n—*oo X^Lj 

Let I = {i : — X(jj > £2 (A)} and m„ = 4j=I/n. In addition, for a given j let 



Hnl 

Hn2 



sup |o„(A) - a(A)| < £ 2 (Aj) ^ , 

|A-A,|<£i(A,) 



sup |/3„(A) - /3(A)| < £2 (Aj) > , 



sup ls„(A) - s(A)l < £2 (Aj) > , 
|A-A^|<ei(Aj) I 
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and 



Then, by (A4) 



In Hn, we have: 



lim P{Hn) = 1. 



( 6 . 1 ) 



> 



> 



> 



inf Pn{X) 

AELj 

^(i+l)5 ^(i)5 ^2 (Aj)) -h 

i^I 



'lb ( O^n(A), /?n(A), X(^)) \ / r(A, On(A), /3n(A), , X(i-^i)) \ 

^ ^ ^ 5n(A) y \ Sn(A) y 



iei 



Sn (A) 



j) '^{i) 5 ‘^(i+1) 5 ^(i) 5 ^2(Aj)) 

i^I 
^ n— 1 

,'^(i),'i^(z+i),a:(i),^2(Aj)) 



rrinK'^ 

n 

2 mnK^ 

n 



where K = supV^. Therefore, by Lemma 6.3 and (6.1), 

^ n— 1 

lim inf inf p„(A) > p lim - V f(Aj,U(j),U(i+i),X(i),£ 2 (Aj)) 

71 ^OC A^Jjj Tl — ►OO TT' 

Z=1 

^ n— 1 

= pJ[i^-y^f(Aj,Uj,Ur,^j,Xi,e 2 (Aj)) ( 6 . 2 ) 

i=l 

where (ri , . . . , r„) is the inverse permutation of (ji , . . . , j„) . Since this permutation 
depends only on the x^’s, but not on the Ui’s, we have 



E 



^ ri ,— 1 

^ j ^2(Ai)) 

i=l 



E{t{\j,u,u*,x,e2{Xj)), 



(6.3) 



where u, u* and x are independent random variables, the first two with distribution 
F and the third with distribution G. In addition, 



Var - E t{Xj , Ui^ 'iJ'n+i ) ^2{Xj)) 

\ n . ^ 



i=l 



n — 1 



[Var(t(Aj, n, u*, x, 62 (Aj)) 



+ Cov{t{Xj,u, u*, X, e 2 iXj))t{Xj,u*,u**,x*,S 2 iXj)))] 
where u,u* ,u** ,x,x* are independent. Then 



lim Var 

n—^oo 



1 n 1 \ 

— t{Xj^Ui^ Xi, £2(Aj)) 1=0. 






(6.4) 
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Therefore, by (6.2), (6.3), (6.4), Chebychev’s inequality, and Lemma 6.2, we get 

^ n— 1 

p lim t{Xj,Ui,Uri^^,Xi,e 2 {\j))) = E{t{\j,u,u*,x,e 2 {Xj)) > 0, 

n—^cx) fi * ^ 
i—\ 

which proves (a). 

The proof of part (b) is similar to the proof of part (a). The main difference is that 
we now use Lemma 6.1 (ii) instead of Lemma 6.2. This proves plim^_,oo An = Aq. 
Using (A4) we get plimn^oo = o;o and plimn-.oo = Po- 
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Consistency of the Least Weighted 
Squares Regression Estimator 

L. Masicek 



Abstract. The paper deals with the least weighted squares estimator which is 
robust and generalizes classical least trimmed squares. We provide conditions 
under which this estimator is consistent and we prove consistency for general 
regression. 

Mathematics Subject Classification (2000). Primary 62F35; Secondary 62J05. 
Keywords. Robust regression, the least trimmed squares, the least weighted 
squares, weak consistency. 



1. Introduction 

Let US consider the following regression model 

Yi = Xff3o + Zi fori = l,...,n (1.1) 

where Xi = (X^i, . . . ,Xip)'^ is the p x 1 column vector of random explanatory 
variables, /3o is the p x 1 column vector of unknown regression coefficients and Zi 
are the random fluctuations with EZi = 0. Moreover, the sequence of random vec- 
tors Xi, . . . ,Xn is independent and identically distributed (IID), the sequence of 
random variables Zi, . . . , is IID and both sequences are mutually independent. 
For any (3 £ BP denote the ith residual as 

n{(3) ~ {Yi - Xf Xj {(3 - /3o) (1.2) 

and the hth order statistics of the squared residuals by i.e., 

0 < r^i)(/3) < r^2)(^) < • • • < ^h(/5)- (1-3) 

Denote the order statistics of the absolute value of the residuals by r^h\{l3) := 
(i.e., also the square root of the hth order statistics of squared residuals). 
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Now we can define the least weighted squares estimator (LWS) as 

/3n = ■■= argmin^u; (1.4) 

0€RP \ n J ^ > 

where it; : [0, 1] i? is a given weight function. Typically we suppose, that w is 
nonincreasing (i.e., observations with larger residuals have smaller weight). With- 
out loss of generality we suppose w{l) =0. 

The least weighted squares estimator was developed by Visek (2000) and it 
generalizes classical least trimmed squares (e.g., Rousseeuw, 1984; Rousseeuw and 
Leroy, 1987) which we get for the choice w{a) = I {a < a} where I {. . . } is an in- 
dicator function and a G (0, 1). The main reason for developing this estimator was 
to improve applicability. In the least trimmed squares estimator we can adjust only 
one constant but in the LWS estimator we can choose the whole weight function. 
This gives us a chance to increase efficiency or decrease gross error sensitivity. The 
case of a location parameter was discussed by Masicek (2002) and n^/^-consistency 
was proved. 

The LWS estimator has several nice properties. First of all the breakdown 
point can be computed immediately from the weight function. If ti;(a) > 0 for 
a <a and w{a) = 0 for o; > a then the breakdown point of the LWS estimator 
equals min{l - a, a}. This means that the breakdown point is under control and 
we can choose it arbitrarily up to 0.5. Finally, the LWS estimator is regression and 
scale equi variant. 

In the next section we rewrite the function inside the minimization (1.4) in 
the form of a statistical functional (i.e., as a function of the empirical distribution 
function). In Section 3 we provide a criterion for consistency of the LWS estimator 
and the main theorem is formulated. Section 4 provides a detailed proof. 



2. LWS Estimator as a Statistical Functional 

Let us denote the function to be minimized in (1.4) as 

MF„(^) := (2-1) 

h^i \ ^ y 

For X e and z e R denote by F{x,z) the distribution function (d.f.) of the 
{p + l)-dimensional vector (X^, Zi) = {Xu . . . Xip, Zi) and denote by Fn{x^ z) the 
corresponding empirical distribution function (e.d.f.) based on random vectors 
(Xi, Zi), . . . , (Xn, Zn)‘ We suppose distribution functions to be left continuous. 

Now we rewrite MFn(/?) and the LWS estimator into the form of a statistical 
functional, i.e., as a function of the e.d.f. Fn as follows. The residuals in (2.1) can 
be expressed as r^^)(/3) = ^ ~ ^|^|(/^)}* Hence we can reorder 
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the summation in (2.1): 

1 n r 1 

-MFnm = J2^h,n + , ( 2 . 2 ) 

h=l i=l 

where t \= [(3 — /3o) and Wh^n •= (n)] recall that i(;(l) = 0). 

Since we are working with continuous random errors we need not take into account 
the case of r|^|(/J) = for h^l. 

Now we define for an arbitrary (p+ l)-dimensional d.f. G(x, z) (where x £ RP 
and z E R) the following statistical functional 

MFc(/3o + t,G) := J ~ I {k “ ^ G^^{a)} dG{x, z) (2.3) 

where Gt is the d.f. of random variable \ZG—XQt\ and the random vector (Xg, Zq) 
has d.f. G. If we choose G equal to the e.d.f. Fn we will get G^^{a) = F~l{a) = 
'^\h\{Po + 1) for a G (^^, where Fn^t denotes the e.d.f. based on the absolute 
values of the residuals (i.e., based on \Z\ — Xf t|, . . . , -Xjt|). Hence we obtain 

MF4/?o + F„) = - ^ {Zi - Xjtf \{\Zi-Xjt\< r|^|(/3o + 1)} (2.4) 

^ • 1 
Z=1 

for a G (^^, , which is exactly the term in the brackets in (2.2). Now we can 

define 

MF{(3,G):= [ MFa{(},G)dw*{a) (2.5) 

Jo 

where w*{a) := w{0) — w{a). Substituting G := Fn in (2.5) we obtain equation 
MF(^,Fn) = ^MFn(^). This is because MF^iP, Fn) = MFh/n{P^Fn) for a G 
('ZlzI Aj and hence 

V n ’ nJ 

n 

MF(/3, Fn) = Y. ^KuMPh/nilS, Fn), (2.6) 

h=l 

which is equal to (2.2) (see also (2.4)). 

Now we can define the statistical functional T{G) 

T{G) := argminMF(;S,G). (2.7) 

(3eRp 

Obviously T{Fn) — Pn^^, i.e., T represents the LWS estimator. 

3. The Weak Consistency of the LWS Regression Estimator 

The following assumptions will be needed throughout the paper. 

Al: The weight function w is nonincreasing, bounded and its first derivative exists 
almost everywhere. Moreover, there exists 0 < a < 1 such that w{a) > 0 for 
a G (0, a) and w(a) = 0 for a G (a, 1). 
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A2: The random errors Z{ have continuous distribution with EZ? < oo, distribu- 
tion function Fz and density fz- The density is bounded, symmetric, strictly 
decreasing on (0, oo), fz{^) > 0 for x G i? and exists everywhere. 

A3: The random vectors of explanatory variables Xi have finite second moments 
and there exist > 0 and ^2 > 0 such that 

P{\X[t\<Si) <a-S 2 forte Hi, (3.1) 

where Hi := {t e HP : ||i|| = 1}. 

Theorem 3.1 (Weak consistency of the LWS regression estimator). Let conditions 
Al, A2 and A3 be satisfied. Then the LWS estimator is a weak consistent estimator 

off3o, i-e., ho- 

The proof of Theorem 3.1 is in Section 4. Let us now look deeply at the 
conditions from Theorem 3.1. 

Condition A1 implies that the breakdown point of our estimator is positive 
and equal to min{l — a, a} > 0. 

In condition A2 we suppose symmetric and one peak density of the random 
errors. This condition is very important for consistency. Suppose just one dimen- 
sional data with symmetric density of observations which has two sharp peaks - 
one around —1 and one around 1. 1.e. approximately one half of the data is around 
— 1 and one half around 1. If we apply the LWS estimator with w{a) =l{a < 0.5} 
we get a value close to —1 or 1. That is because LWS wants to fit 50% of data. 
But we want to obtain a value close to zero, which is the expected value of the 
observations. 

It can be shown that A3 is equivalent to the following condition: the random 
variables Xi have finite second moments and there exists e > 0 such that 

P{\xjt\=0) <a-s forte Hi. (3.2) 

The last inequality, together with the existence of second moments, implies that the 
p X p matrix EXiXL is positive definite {EXiXL being a positive definite matrix 
is necessary for the consistency of the classical least squares). But condition A3 
is stronger and is important for the consistency of the LWS estimator. Because 
w{a) = 0 for q; > a we try to fit only [na\ observations ([xj denotes the integral 
part of X G i?). Suppose there exists t E such that P {\Xft\ = O) > a. Hence 
for large dataset (i.e., large n) there exists (with high probability) a subgroup of 
observations which contains at least [na\ observations and their matrix of the 
explanatory variables is singular. Hence we have a problem to prove consistency. 

See also Davies (1990) for the definition of An(a) (page 1657). The function 
An(o;) measures the worst possible conditioning of any [naj subset of the explana- 
tory variables for the linear regression model with fixed explanatory variables. We 
can define an analogical measure in the case of random explanatory variables as 
follows. Let S{a) is the largest possible 5 which satisfies P < (5) < a for any 

t E Hi {S{a) is equal to the largest possible in (3.1) for a := a — 52). Condition 
A3 implies 5(o) >0 for some a < a. 




Consistency of the Least Weighted Squares Regression Estimator 187 



4. Proof of the Weak Consistency of the LWS Regression Estimator 

To prove consistency we approximate the function MF(/?, Fn) by MF(^, F). If we 
knew that MF(/?, F) attains a global minimum for /3 = Po and is increasing in the 
neighborhood of then we would know that MF(/J, has (for large n, with 
high probability) global minimum close to /?o- The following lemma proves the fact 
that MF(^, F) has minimum for /? = /?q. 

Lemma 4.1. Under conditions Al, A2 and A3 for any r/ > 0 there exists 5 > 0 
such that 

MF{P, F) > MF(/?o, F) + S (4.1) 

for any (5 G BP, \\(5 - /3o|| > r]. 

Proof of Lemma 4.1 Because the LWS estimator is regression equivariant, we may 
put Po = 0. It is sufficient to prove that MF(/3, F) is continuous function of P and 
that MFaiP^F) strictly increasing in 0 in any direction, i.e., 

MFc,(0, F) < MFa{P. F) for 0 < a < 1, /? 7 ^ 0, (4.2) 

MF^(/3, F) < MF^(A:/3, F) for 0 < a < 1, ^3 7 ^ 0, F > 1. (4.3) 

Both inequalities together with (2.5) imply the same property for MF(/3, F), be- 
cause MF(/3, F) is an average of MFa{P, F) (see (2.5)). Hence MF(;3, F) is strictly 
increasing in 0 in any direction. Because MF(/?, F) is also continuous the inequal- 
ity (4.1) holds. Hence the proof is completed by showing continuity of MF(^, F) 
and inequalities (4.2) and (4.3). 

To prove the continuity of MF(^, F) we substitute (2.3) into (2.5) and rewrite 
MF(^, F) into the form 

MF{l3o + t,F)^ I w {Ft {\z - x'^t\)) {z - x'^tf dF{x, z). (4.4) 

Hence we can easily verify MF(/?', F) = MF(;3, F) and we can also express 

^MF(/3, F) (the formula for ^MF(/3, F) in case of a location parameter is in 
Masicek, 2002). 

We prove inequalities (4.2) and (4.3) using stochastical ordering (for the 
definition see appendix, Lemma 5.1 and 5.2). Let X and Z be two independent 
random variables with conjugated d.f. F(x, z) = Fx{x)Fz{z) (x G and z e R) 
where Fx{x) and Fz{z) are marginal d.f.. Let us express MFq,(/?, F) for fixed 
a G (0, 1) and P G RP. By definition (2.3) the value MFc(^, F) depends only on the 
distribution of the random variable \Z — X^/3|. Because X and Z are independent 
and Z is symmetric it is easily seen that {X^Z) =d {X , Z sign{X'^ P)) and hence 
(denote = d equation of distributions) 

|Z - X^(3\ = |Zsign(X^/3) - \X^(3\\ =d \Z - \X^0\\. (4.5) 

Let us introduce random variables 

Vp:=\X^f3\, U0~\Z-Vi3\. 



(4.6) 
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Obviously MF a{P^F) depends only on the distribution of Up. It is easy to check 
(see (2.3)) that 

MF„(/3,F) = E([/')2 (4.7) 

where Up := Up • I {Up < up{a)} and up{a) is defined by equation 

P{Up < up{a)) = a. (4.8) 

Notice Fp is the d.f. of the random variable Up and up{a) = F^^{a). Let us recall 
that Fp is the d.f. of \Z - X'^/3\ (see (2.3) - the definition of Gt). 

Let us denote by Fp^ the d.f. of C/^. Distribution function Fp^ can be easily 
expressed using Fp 

Fp*{u) = min{Fp{u) + (1 - a), 1} for u>0. (4.9) 

This holds because U^ is equal to Up cut off at its a-quantile. For u <0 obviously 
Fp{u) = Fp^{u) = 0. 

Now we prove inequalities (4.2) and (4.3). Let /?i ,/?2 ^ be fixed such that 

^ Vp^, (4.10) 

i.e., Vp^ is stochastically larger than Vp.^ and both random variables do not have 
the same distribution. If we prove 

MFc.(^i,F)<MF,,(/? 2 ,F) (4.11) 

the inequalities (4.2) and (4.3) will follow. The first inequality (4.2) is a special 
case of (4.11) for := 0 and (32 := /3 ^ 0. It follows immediately that Vp^ -< Vp^ 
(because Vp^ = Vb = 0 and Vp^ =Vp>Qi) and Vp^ ^ d Vp^ (suppose Vp =d Vq = 0 
hence = 0 a.s. which is in contradiction to assumption A3). The second 
inequality (4.3) is obtained for (3i := (3 and /?2 := Kf3^ where /? 7 ^ 0 and K > 1. 

It remains to prove (4.11). The d.f. of Up for u > 0 and (3 ^ RP is 

Fp{u) = P (Up <u) = P (-U < Z — Vp < u) (4-12) 

= EP (-U < Z -V < u\Vp = -u) = E [Fz(Vp + u) - Fz(Vp - u )] . 

The last equality holds by independency of Z and Vp. 

Fix > 0. Observe that the function g(x) := Fz(x-\-u) — Fz(x — u) is strictly 
decreasing in a; for x > 0 because 

+ u) - Fz{x - u)] = fz{x + u)~ fz{x -u) <0 (4.13) 

for X > 0. The last inequality follows by assumption A 2 (|x + r^| > \x — u\ and fz 
being symmetric and strictly decreasing on (0, oo)). By Lemma 5.2 (implication 
1. ^ 4.), (4.10) and (4.12) is 

Fp, {u) = Eg{Vi 3 , ) > Eg{V/ 3 ^ ) = Fp^ {u) for u > 0. (4.14) 

Combining (4.9) and (4.14) implies that Fp^^(u) > Fp^^(u) for u > 0, i.e., 
Up^ Lemma 5.1, implication 3. =4^ 1.). Using Fp(0~^) < a (see condition 

A3 and definition of Fp) we can assert that for u from a right neighborhood of 
zero is Fp^^{u) > Fp^^(u). Applying Lemma 5.2, implication 2 . ^ 3. (substitute 
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X := Y := and h{x) := sign(a;)a:^) yields 

MFa(/3i,F) = E {U'^y = E/i ([/' J < Eh {U'^ = E = MF„(/32,E), 

(4.15) 

because > 0 and > 0, i.e., (4.11) holds. □ 

Proof of Theorem 3.1 To prove the weak consistency we will use Prochorov’s metric 
defined for any two d-dimensional distribution functions G and H as follows 

7t(G, H) = inf {5 : Pg{V e A) < Ph{V e A^) + 5, A- Borel subset R^} (4.16) 

where A^ is the set of points in which have distance from A smaller than 5 
and Pg{V G A) denotes the probability of [V e A] if V has distribution function 
G (for details see Jureckova and Sen, 1996, Section 2.5.3). 

Let us fix a probability space (fl. A, P) and denote Fn the e.d.f. based on ran- 
dom vectors (Xi, Zi), . . . , (X^, Zn)- Observe that Fn F pointwise and almost 
surely (by the SLLN), and hence weakly. Because Prochorov’s metric metricize 
weak convergence we obtain 7r(Fn, F) ^ 0 a.s. and hence 

n{Fn,F)^pO. (4.17) 

The LWS estimator is regression equivariant, hence we may put Po = 0. 
Suppose the constant S G (0, 1) is small but fixed. In the next part of the proof we 
will need several upper bounds on S. These bounds will depend only on the weight 
function w and the d.f. F (it would be complicate to express S now therefore we 
express it at the end of the proof). 

Denote by := EZf -f 1. Convergence (4.17) together with the weak law 
of large numbers gives us the following: there exists a sequence Bn C ft such that 
P{Bn) 1 and for any uj ^ Bn and any Borel A C R^~^^ the following inequalities 
hold 

P{{X, Z) e A) < Pn{{X, Z) eA^) + S (4.18) 

Pn{{X,Z)eA)<P{{X,Z)eA^) + 5 (4.19) 

(4.20) 

i=l 

where P denotes the probability measure based on d.f. F (i.e., (X, Z) has d.f. F) 
and Pn denotes the empirical probability measure based on Fn (i.e., (X, Z) has 
d.f. Fn). 

We split the rest of the proof into two steps. 

1. We find a constant Ki (which does not depend on J) such that for any to e Bn 

MF{p, Fn) > MF(0,F,) for \\p\\ > Xi. 

2. We find a constant X 2 (which does not depend on 6) such that for any u £ Bn 

|MF(/3,F„) - MF{(3,F)\ < SK 2 + -K 2 for ||/3|| < K^. 

n 
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These two results together with Lemma 4.1 conclude the proof. Fix rj > 0. 
By Lemma 4.1 there exists > 0 such that for and j3 with ||^|| >rj it holds 

MF(Ai^) >MF(0,F) + 5i. (4.21) 

The second step together with (4.21) implies 

MF(/3, Fn) > MF{I3, F) - SK2 - > MF(0, F) + Si~ SK2 - (4.22) 

> MF(0, Fn) + - 25K2 - 2-K2 

n 

for any /? with 77 < ||/3|| < K\. The first and the third inequalities come from step 2 
and the second one from (4.21). Choosing 5 < 5i/{6K2) and rii := 6K2/S1 (notice 
61 and K2 do not depend on 5) we obtain 

MF(^, Fn) > MF(0, Fn) for ry < ||/?|| < Xi, n > m and a; G Bn. (4.23) 

The last inequality, together with step 1, implies 

MF(/3,Fn) > MF(0,Fn) for ||^|| >rj,n>ni and u G Bn. (4.24) 

Hence for n > rii and for any u G Bn, ||/3n|| < V because /3n is defined by 
minimization of MF(/?, Fn) (see (1.4)). Together with convergence of P{Bn) 1 
we obtain weak consistency of /3n because rj was an arbitrary positive number. 
Now we should prove steps 1 and 2. 

Fix uj E Bn for the rest of the proof. Then there are two distributions - the 
empirical (i.e., Pn where (X,Z) has d.f. Fn) and the theoretical (i.e., P where 
(X,Z) has d.f. F). 

Let us prove step 1, i.e., we find a constant Ki (which does not depend on 5 
or u) such that for any (3 with ||^|| > K\ and u E Bn 

MF(/3,Fn)>MF(0,Fn). (4.25) 

To prove this we show that MF(0, Fn) is bounded on Bn and that MF (/3,Fn) is 
increasing with ||/3||. 

The upper bound for MF(0, Fn) comes from (4.20) and from condition A1 
{w is nonincreasing) 

MF(0,F„) = 1 rf^)(0) < t«(0) • ^ < w{0)K, (4.26) 

h=l ^ ^ i=l 

for any u E Bn. 

Now we find a lower bound for MF(/3, Fn). By condition A3 there exist ^2 > 0 
and (^3 > 0 such that for /? 7^ 0 

P {\X^(3\ < 2<53||/3||) < a - 552. (4.27) 

We can choose ^3 in such a way that ^3 < J2. Supposing 5 < ^3 (notice ^3 does 
not depend on 5) and applying (4.19) for A := {(x, z) G : |x^/3| < 53||/?||} 
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we obtain 

Pn (|X^/3| < 53||/3||) = Pn{XeA)<P{XeA^)+S (4.28) 

= P {\X^f3\ < {Ss + J)||/3||) +S<P {\X^f3\ < 2<53||/3||) + <53 

because A^ = {{x,z) G : |x^/?| < (<53 + <5)||/3||}. Combining (4.27), (4.28) and 
^3 < ^2 yields 

Pn {\x^l3\ < <53||/5||) < S - 5.52 + <53 < a - 4^2, (4.29) 

i.e., for any /3 e and u e Bn there is no more than [n{a — 4^2 )J observations 

which have \Xf /3\ < 53||/3|| (denote [x\ the integral part of x G i?). 

Let us take sufficiently large such that 

P{\Z\>Ks-1)<S2. (4.30) 

Notice that Ks depends only on d.f. F. Inequality (4.19) for A := {{x,z) : |2:| > 
Ks — 5}^ together with (4.30), implies that for any uj e Bn (recall S < S2 and 
5<1) 

Pn {\z\ >Ks)<P {\Z\ >Ks-S)p 5 <S 2 + S< 252, (4.31) 

i.e., no more than [2n^2j random errors have \Zi\ > Ks. 

Notice that for any (3 there are no more than [n(l - {a - ^2))] observations 
which do not have weights greater than or equal to i<;(a — ^2) > 0, because w is 
nonincreasing and hence at least [n{a - ^2)] weights are greater than or equal to 
w{a — 62 ). Together we get that for any (3 and a; G Bn at least n* observations 

satisfy \Xj (3\ > (Ssll/?!! and \Zi\ < K3, and have weight greater than or equal to 

w{a — 82). The number of these observations can be expressed as 

n* :=n- [n{a - 4^2)] - L2n^2j - L^(l - (a - ^2))] > ^^2. (4.32) 

Suppose 5s\\^\\ > Ks. Hence at least n* terms in (2.1) are greater than or equal 
to w{a — 52){5 s\\/ 3\\ — Ks)‘^. This implies 

MF(/3,F„) = -MFn{f3) > -n*u;(a - J2)(<53||/5|| - (4.33) 

n n 

> S2w{a - < 52 )(< 53 ||, 5 || - Ks)'^. 

Combining (4.26) and (4.33) implies that there exists a sufficiently large 
constant K\ such that for any (3 with ||^|| > K\ and ou e Bn 

MF{(3,Fn) > 52w{a - <52)(<53||/3|| - > w{0)K, > MF(0,F„) (4.34) 

which completes step 1. Notice K\ does not depend on 5 or uo. 

It remains to prove step 2, i.e., to prove that on Bn the value MF(/?, Fn) is 
close to MF(/?, F) for any (3 with ||^|| < K\. Because MF is an average of MFc^ 
(see (2.3) and (2.5)) it is sufficient to prove that MFq,(^, F^) is close to MFoc{P, F) 
for a <a. 

Fix (3 with ||/3|| < Ki. We start with looking for an upper bound for the 
squared residuals on Bn. Let us take sufficiently large F4 and F5 such that 

P {\Z\ > F4 - 1) < (1 - a)/4, F (||X|| > F5 - 1) < (1 - a) /A. (4.35) 
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Notice that and depend only on d.f. F. Suppose 5 < (1 — o:)/4. Inequalities 
(4.35) imply for any oj ^ Bn (see (4.19) and recall 5<i) 

Pn i\Z\ < Ki) = [1 - i\Z\ > K^)] >[l-P {\Z\ >K^-5)-S\ (4.36) 

> l-(l-a)/4-(5> (l + a)/2, 

Pn (ll^ll < K 5 ) = [1 - p„ (||X|| > Ks)] > [1 -P{\\X\\ >K 5 -S)- 6 ] (4.37) 

> l-(l-a)/4-5> (l + a)/2. 

Inequalities (4.35) also imply 

P{\Z\<K,)>{l + a)/2, P(||X||<ii:5)>(l + a)/2. (4.38) 

Denote Kq := K 4 + K 5 K 1 . The inequality ||/3|| < Ki, together with (4.36), (4.37) 
and (4.38), implies (for any u; E Bn) 

Pn{\Z-X'^/3\<K6)>a, P{\Z-X^l3\<Ke)>a. (4.39) 

The first inequality in (4.39) gives us that for any /3 with \\/3\\ < K\ and for 
any a; G Bn at least [naj observations have residuals {Zi — Xj (3) bounded by a 
constant Kq. The second one provides an upper bound on the theoretical quantile. 

Let us recall that Fp is the d.f. of the random variable \Zi — Xf (3\ and Fn^js 
the corresponding e.d.f. Inequalities (4.39) imply Fn^( 3 {Ko) > a and F^{Kq) > a. 
Hence we can write 

r\hm = Fnlih/n) < Ke, F^\a) < K, (4.40) 

for h < [naj, a < a and (3 with ||^|| < K\. Notice that we are still working on 
Bn, i.e., the first inequality in (4.40) holds for any u e Bn- 

Now we prove that Fn^p is close to Fp. Applying (4.19) yields 

FnA^) = Pn i\Z - X^P\ <u)<P{\Z- X^(3\ < u + 5(1 + II/3II)) + 5 (4.41) 
= Fp{u) + P{u<\Z- X^(3\ < u + 5(1 + ||,3||)) + 5 
for -u > 0. Independency of Z and X and inequality \\/3\\ < K\ imply 
P{u<\Z- < u + 5(1 + 11,511)) = EP[u <jZ- X^0\ < u + 5(1 + ||/3||)|X] 

= eJi{u<\z- X^/3\ <u + S{1 + II/JII)} fz{z) dz 

<2S{l + \\/3\\)Mf <2S{l + Ki)Mf (4.42) 

where M/ := sup{/z(z) : z e R} < oo. Combining (4.41) and (4.42) yields 

Fn,pA) < F 0 {u) + ^(1 + 2(1 + Ki)Mf) = F^U) + 5^7 (4.43) 

for any (3 with ||/3|| < Ki where Kj := (1 + 2(1 + K\)Mf). Notice K-j does not 
depend on 5 or w. In the same way we express the lower bound for FnA'^) (using 
(4.18)) and obtain (||.||oo denotes the supremum norm) 

||Pn,/3 - F/sWoo < SKj 



(4.44) 
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Fix q; < a. In the rest of the proof we use the same method as in Lemma 4.1. 
We define a random variable which has d.f. Fp and random variable Up which 
has d.f. Fp^ where 

Fp^^{u) = min {Fp{u) + (1 - a), 1} for > 0 (4.45) 

and Fp^(u) = 0 for < 0. Notice that Fp^{u) = 1{oi:u>Kq because Fp{K^) > a 
(see (4.39) or (4.40)). Hence we can write 



MF«(/3,F) = E([/^) - / / {l-F0,{V^))dz. (4.46) 

Jo Jo 

In the same way we express MFa{P,Fn). Denote by Fn^p^ the e.d.f. based 
on values \Zi - Xf /3\ • I - Xj" /3\ < r\h\{P)} where h := [na\. Because Fn^p 
is not continuous we should express Fn^p* more carefully. Noticing r^h\{l3) is the 
Qf-quantile of Fn^p gives 

Fn,p*{u) = min{Fn,p{u) + [n(l - a)J/n, 1 } for u>0 (4.47) 



and Fn,p*{u) = 0 for -u < 0 . Again Fn,/ 3 *(^) = I for u > Kq and u ^ Bn because 
Fn^p{Ko) > a (see (4.39) or (4.40)). Hence 



MFa{fJ,Fn)= {l-FnMV^))dz 



Combining (4.44), (4.45) and (4.47) gives 



Fn,p* - -P/ 5 * Hoc < dKj 4- 1/n. 



(4.49) 



Last inequality, together with (4.46) and (4.48), implies 

|MF„(/3,F„)-MF„(AF)|= f \FnM^)-F,}*{z))dz (4.50) 

Jo 



< / \Fn,i3*{z) - F0*{z)\ dz < SKs + Kd/n 

Jo 

for ||/?|| < Ki and lu G Bn where Ks := XqKj and Kq := Kq which concludes 
the proof of step 2 because we can take K 2 := max{w{0)Ks,w{0)Kg} (see (2.5)). 
Notice that K 2 depends only on the weight function w and the d.f. F. 

Finally notice that it is sufficient to take 

5 := min{l, ^i/( 6 X 2 ), ^ 3 , (1 - a)/4} G (0, 1) (4.51) 

where constants 5i, K2^ Ss and a depend only on the weight function w and on 
the distribution of random errors and explanatory variables (i.e., on the d.f. F). 

□ 
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5. Appendix 

Let X and Y be two random variables. We say that Y is stochastic larger than X 
(denote X ^ y) if and only if Eh{X) < Eh{Y) for any nondecreasing function h 
for which both expected values exist and are finite. 

Lemma 5.1. {stochastic ordering 1) Let X and Y he two random variables with 
distribution functions Fx and Fy . Then the following conditions are equivalent: 

1. Eh{X) < Eh{Y) for any nondecreasing function h for which both expected 
values exist and are finite (i.e., X ^Y), 

2. Eg{X) > Eg{Y) for any nonincreasing function g for which both expected 
values exist and are finite, 

3. Fx{z) > Fy{z) for z e R. 

Lemma 5.2. {stochastic ordering 2) Let X and Y be two random variables with dis- 
tribution functions Fx and Fy . Suppose that X ^Y, then the following conditions 
are equivalent: 

1. X Y, 

2 . there exists zq ^ R such that Fx{zq) 7 ^ Fy{zo), 

3. Eh{X) < Eh{Y) for any increasing function h, for which both expected values 
exist and are finite, 

4. E^(X) > Eg{Y) for any decreasing function g, for which both expected values 
exist and are finite. 
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Algorithms for Robust Model Selection 
in Linear Regression 

S. Morgenthaler, R.E. Welsch and A. Zenide 

Abstract. In modelling situations, we often have a choice. We can add to 
model complexity to describe some unusual observations or groups of obser- 
vations or we can remove those observations and fit the rest with a different, 
perhaps simpler, model. Or, we might put weights on observations and/or 
variables and use all of the data or all of the model complexity or some com- 
bination. To decide what is unusual generally requires some sort of model, 
but that very model may be what is causing some observations to be seen as 
unusual or, in fact, masking unusual observations. 

When confronted both by unnecessary model complexity and by unusual 
observations, a model selection technique must be able to identify the correct 
model structure as well as unusual observations. This paper proposes some 
new algorithms that are designed to address these problems and compares 
them with currently available approaches such as robust Cp and robust cross- 
validated selection. 

Mathematics Subject Classification (2000). Primary 62F35; Secondary 62G09. 
Keywords. Variable selection, outliers, cross-validation, robust regression. 

1. Introduction 

Given observations t/ = (yi, . . . , ^ and corresponding instances of explana- 

tory variables Xi eW, . . . ,Xn eW, we consider linear predictors 

%=0o+$'xi ( 1 . 1 ) 

with regression coefficients /?o, A? • • • ? We assume that the observations are 
realizations of 

Yi ^ E{Yi\xi) + Si, i = l,...,n, (1.2) 
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of Management. 




196 



S. Morgenthaler, R.E. Welsch and A. Zenide 



where Si, . . . are independent and have common mean 0 and variance A 
model selection technique selects a best subset among the p explanatory variables 
such that an accurate linear predictor of type (1.1) is obtained if non-zero regres- 
sion coefficients are only allowed for the selected variables. This is a fundamental 
problem of statistics, which is of great importance beyond the linear regression 
setting. Assuming a Gaussian distribution for the errors Si many asymptotically 
equivalent criteria have been proposed, among them (in chronological order): 



MS/df(subset) 

FPE(subset) 

PRESS(subset) 



AIC (subset) 
AlCc(subset) 



= RSSsubset/(n - ^subset)^ (Tukey, 1967) 

= RSSgubset (^ “i” ^subset)/(^ ^subset) (Akaike, 1969) 
= leave-out-1 cross-validated RSSgubset (Allen, 1971) 

i=l 

= Ti log (RSSgxibset) H” 2 ^subset (Akaike, 1973) , 

= Tl log (RSSsubset) ~t“ 2 (^subset “h 1) X 

(^ - ^subset - 2) (Sugiura, 1978) , 



where ^subset = (size of the subset + 1) counts the number of non-zero regression 
coefficients in (1.1). These criteria only depend on the subset and in particular 
do not require knowledge of cr^. The basic ingredient for their computation is 
the residual sum of squares RSSsubset = (.Vi ~~ ^he case of PRESS, 

the leave-out-1 cross-validation estimate refers to iVi ~ y^,cv)^, with pi^cv 

denoting the prediction computed without using the ith observation. When using 
the least-squares estimator, the corresponding residuals can be directly computed 
with the help of ha = ^i)~^ which on average are equal to ^subset/^- 

The MS/df criterion was proposed by J.W. Tukey in reaction to a remark in 
Anscombe (1967) and the finite sample correction to AIC was popularized and 
generalized by Hurvich and Tsai (1989). As shown in Shibata (1981), these criteria 
are asymptotically efficient in the sense that the variance of the prediction error 
achieved by the chosen predictor converges to the best variance possible. The first 
three criteria are all approximately proportional to RSSsubset (1 + 2 ^subset /^)- 
If we assume that the mean function E{Yi\xi) satisfies 

Yi — jJj 0 Xi ~ 1 ~ £i^ Z — 1, . . . , ? 7 > , (1.3) 



for some unknown p eR and 0 eRP^ then a different formulation of the model se- 
lection problem asks for identification of the “true subset” , that is the index set of 
the non-zero components of 6. Related to this are the notions of overfitting, under- 
fitting (parsimony) and asymptotic consistency. All the criteria mentioned above 
are asymptotically inconsistent and tend to select too large a subset (they overfit). 
PRESS can be rendered consistent by fitting on fewer than n — 1 observations and 
thus keeping in reserve for validation more than 1 observation (leave-out-more- 
more-than-1, cross-validation, see Shao, 1993). In fact, the validation set should 
be 0(n). 
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In practice, the errors are rarely generated by a Gaussian distribution and 
this invalidates the theoretical arguments behind both the least squares estimates 
used in (1.1) and the selection criteria based on the residual sum of squares. Robust 
model selection refers to the problem of finding a good subset - either for prediction 
or in order to identify correctly the true model - in a situation where the observa- 
tions (both X and y) may contain outliers (gross errors) and with errors exhibiting 
non-Gaussian behavior. A possible solution to this problem is the construction of 
robust versions of the model selection criteria, applicable for heavy-tailed error dis- 
tributions and based on the replacement of (3o and /3 in (1.1) with robust estimates. 
A nonexhaustive list of related papers contains Ronchetti (1985), who used the 
AIC statistic, Hurvich and Tsai (1990), who considered least-absolute-deviations 
regression and AICc, Ronchetti and Staudte (1994), who used the Cp statistic (see 
Mallows, 1973) and Ronchetti et al. (1997), who used cross-validation. A Bayesian 
approach is contained in Hoeting et al. (1996). 

It is evident that outliers can create insoluble problems due to the heteroge- 
neous apportioning of the information in regression problems. If we are asked to 
uncover unusual observations using only the available data, then we have to rely 
on the rule of the majority; that is, an observation is unusual because most of the 
others are quite different. Outliers in data generated by (1.2) or (1.3), however, 
are outliers because of the outlyingness of the errors and these we do not observe 
directly. Instead we have to rely on residuals — yi and these in turn depend on 
our estimation of and /3. Each residual thus does not only depend on its asso- 
ciated error, but rather on the errors of all the data points used in the estimation. 
If an observation is exceptionally informative for one of the regression parameters, 
it will necessarily have a small residual and its outlyingness is directly linked to 
the presence or absence of the corresponding parameter in the selected model. In 
a simple linear regression, for example, with xi = ■ - = Xn-i = 0 and Xn = 1, the 
last observation can never be judged internally. If it is an outlier, we simply have 
had bad luck. Robust methods cannot resolve this predicament and can thus only 
be applied in situations where sufficient redundancy is assured. 

In some sense, uncovering outliers and selecting a subset of the variables are 
interchangeable problems, a fact that can be made explicit by adding a dummy 
variable for each observation. If we had a model selection method that could deal 
with the case p > n, then we could select not only among the original p explanatory 
variables, but also from the n dummy variables. Each dummy variable ending up 
in the final selection corresponds to an observation that is being treated specially. 
It is also important to realize that outlyingness can only be defined with respect 
to a model. 

In this paper, we will explore methods that play on this duality between vari- 
able selection and data selection. Sampling or selecting among the n observations 
is a widely used statistical technique that can be adapted to robust regression 
and regression diagnostics. It has been applied to inference problems (bias es- 
timation, variance estimation, confidence intervals) (cross validation, bootstrap. 
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jackknife), the detection or nomination of outliers (Hawkins et al, 1984, Billor 
et al., 1989 and Fernholz et al., 2004), to the computation of high-breakdown 
estimators (Rousseeuw, 1984 and Rousseeuw and Leroy, 1987) and for general di- 
agnostic purposes (Atkinson and Riani, 2000). We will use sampling for variable 
selection as well. 



2. Building Blocks 

Variable selection in a linear model is a challenge even in the best of circumstances 
when 

y = /?o + /?i 3^ 1 + • • • + /3p + e , 

with the random errors ^i, . . . , uncorrelated and following a normal distribution 
and the design a?i, . . . ,ccp carefully chosen. Many different methods for selection 
among the variables {cci, . . . , ccp} have been proposed and analyzed. Most are 
based on the minimization of the estimated prediction error of the selected model. 
No best method, universally agreed upon, exists even in this simplistic setting. 

The presence of outliers among the observed errors impacts variable selection 
profoundly. If we simply apply methods designed for the Gaussian case, we may 
make a bad choice in the sense that a single outlier can lead us to exclude one 
or several important variables from the final model and substitute variables with 
little predictive power. This traditional robustness problem is intimately linked 
with the difficulties caused by the presence of outlying points in the design space. 
Such singular points can also have an overwhelming influence. They greatly enlarge 
the space spanned by the design and if we search for a model with good predictive 
power over the whole span then these points cannot simply be ignored. On the 
other hand, if a wild design point is afflicted with a very large error, then it 
may be impossible to estimate a predictor that works well over the whole span. 
This dilemma is very different from cases where replicated observations that allow 
a clear identification of the potential outliers are available. Sometimes the best 
one can do is to delete wild design points and to concentrate on the prediction 
within the reduced span. The developer of a procedure for variable selection in the 
presence of outliers must respond to these challenges. The choices made by the 
developer should be transparent and understandable to the user. 

Two methodologies are discussed in the literature. On one hand, pretreat 
the data with the aim of flagging potential outliers. The technique most often 
cited for this purpose is LMS (least-median-squares) (Rousseeuw, 1984) applied to 
the model using all available variables. This method is easy to explain and finds 
outliers even when they are masked by other outlying observations. Note that all 
available explanatory variables are used in the initial outlier screening. On the 
other hand, one may simultaneously nominate outliers and select variables. This 
is more appealing because it respects the dualism of these tasks. If we combine 
the data into the matrix [xi \ • • • |ccp|y] , variable selection refers to the selection of 
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some among the first p columns, whereas outlier nomination refers to the flagging 
of some of the rows. 

LMS is a robust regression fit most often implemented via a random search 
based on repeatedly drawing (p+ 1) rows at random, computing the corresponding 
fit and the value of the median squared residual over all observations. The fit 
with the minimal value of this criterion is Anally chosen. Potential outliers are 
observations with large absolute residual relative to the final fit. One can modify 
this procedure by updating an outlier score per observation based on the size of 
its residual in the randomly drawn fits. A host of outlier nomination procedures 
are discussed in the literature and could be used for this purpose. The impact 
of the pretreatment step on the second stage, the variable selection, is unclear. 
Sometimes the variable selection step is based uniquely on the “cleaned data”, 
that is the data where the putative outliers have been removed. In this case, one 
has to be careful not to remove highly informative observations. It is preferable, 
if such outliers are kept in play at the second stage, since the flagged observations 
in any case probably also contain valuable information. 

The parallel search of a model and of potential outliers is naturally imple- 
mented by sampling. In the base procedure, we select a random sample of the 
observations and of the variables and fit the corresponding model to the sampled 
observations by least squares. We then compute a robust estimate of the predic- 
tion error based on the residuals of the observations that have not been sampled. 
Computing the best fit, the one minimizing the estimated prediction error, over 
repetitions of the sampling leads to a robust model choice as long as the sub- 
samples are not too large. If large subsamples are chosen, robust fitting should 
be considered. This basic procedure can be modified in many directions. We can 
search the model space in a more systematic fashion, for example in the manner 
of a forward/backward method in which models are compared to models with one 
additional or one less variable. Furthermore, this comparison can be sharpened 
by computing each of the models under comparison for all subsamples. We can 
also choose the subsamples of the observations in a balanced manner, ensuring, 
for example, that all observations are equally often represented or that all couples 
of observations are equally often represented, and so on. Another idea consists 
in replacing simple random sampling by weighted sampling with the weights re- 
adjusted after each step. The weights on the observations can take into account 
two aspects, they increase with decreasing outlyingness score and they increase 
with increasing information content. The weights on the variables should ideally 
increase with increasing predictive power of the variable and point us towards 
the good models. This can in general not be achieved with simple multinomial 
sampling and requires more sophisticated approaches. 

Robust variable selection can thus be formulated as an optimization problem 
where a robust estimate of the prediction error is to be minimized over all possible 
models. The impact of the choice of this estimate on the quality of the procedure, 
the behavior of the procedure in the case of unbalanced designs, etc. is unclear. 
Similarly, there are a variety of algorithms for solving such problems and in this 
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area too, not enough is known about the relative merits of various procedures. 
The results of extended simulation studies would, in our opinion, be a welcome 
addition to the existing literature. 



3, Penalty Methods 



In this section we explore the idea that uncovering outliers and selecting a sub- 
set of variables are, in some sense, interchangeable. Our matrix of “explanatory” 
variables becomes 



Xii • « 




1 


0 


0 • 


• 0 


X21 • ■ 




0 


1 


0 • 


• 0 


X31 •• 


^3,p+l 


0 


0 


1 ■ 


• 0 


Xnl 


^n,p+l 


0 


0 


0 • 


• 0 



where the last column has been omitted to remove the intercept singularity. Since 
the number of columns now exceeds the number of rows, many traditional regres- 
sion and selection procedures cannot be applied. However, penalty methods are 
useful in such situations and can provide some insight into our model and possible 
outliers. We now assume that the x- variables have been centered and scaled and 
that the “dummy” columns have been centered. The new matrix, of explanatory 
variables, which we call Z, will be n by n + p - 1 and our problem becomes 

f n / n+p-1 \ ^ n+p-1 1 



min < 

^ 1 



V j=i 



X ^ 



(3.1) 



with l3o estimated by y. The penalty parameter, fc, will control the complexity of 
the problem, i.e., the degrees of freedom used up in the estimation process. We 
can never use more than n degrees of freedom, even though we have n + p - 1 
possibilities. For k large enough, (3.1) is a well posed problem. The complexity or 
effective number of parameters (Hastie et al., 2001) is given by 

trace Z(Z'Z + H)“^Z'. (3.2) 



There are many variations on (3.1). If we use the LI norm for the penalty 
term, we obtain the “Lasso” of Tibshirani (1996). This is potentially useful in 
our case since we expect relatively few coefficients to be non-zero, but will not be 
explored in this paper. We use the L2 norm for the estimation term in (3.1) because 
the row dummies in the penalty term are designed to provide some robustness to 
outlying observations. A more robust loss function (e.g., LI) could be used but at 
increased computational cost. 

Fan and Li (2001) consider a wide variety of penalty functions in the selection 
context. When extended to our selection problem with both real and dummy 
explanatory variables, the SCAD penalty functions of Fan and Li would allow large 
outliers not to be penalized (row fully removed without bias) and also allow rows 
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with small outliers to have their dummy coefficients set to zero. We are planning 
to see if their methods will improve upon the preliminary results presented here. 

Penalized methods require that we choose k. When the number of parameters 
is less than the number of observations there are many ways to do this, but a 
common choice is cross-validation. When the number of parameters exceeds the 
number of observations, there is almost no other choice. 

Cross- validated penalized methods seem to provide excellent results for pre- 
diction (Prank and Friedman, 1996) where there is no need to actually “select” 
variables. In order to compare our results to a known model with known outly- 
ing observations, we need to select the real explanatory variables as well as those 
dummy variables that try to fit the outlying observations. 

We choose to simply look at the ^-statistics for the estimated coefficients 






a^(Z'Z + fcI)-iZ'Z(Z'Z + fcl)-/ 



with a estimated robustly by 



a = 1.483 med | yi - zj 0{k) - med | yi - zj ${k) 



When \ tj{k) | is greater than 2, we take the variable to be in the model. This can 
be further refined by the use of degrees of freedom, such as n minus the complexity 
(3.2), because the complexity is a measure of the degrees of freedom “used up” by 
the model. 

In practice, the penalty method works well for cases where gross errors and 
leverage are not serious problems. However, the penalty ||/3|p grows with the 
number of outliers and interferes with the selection of the “real” variables. One 
possible solution is to have two penalty functions - one related to the row dummy 
variables and the other to the real explanatory variables. This requires double 
cross-validation and increased computational complexity. 

Our approach is to use the penalty method in five stages: 

1. The solution to the optimization problem (3.1) is found using robust (MAD) 
cross-validation to determine k. We used ten- fold cross validation which leads 
to large training sets. Smaller training sets as suggested by Shao (1993) did 
not dramatically change our results. 

2. All rows with dummy variables having |^| > 2 are removed. 

3. The model (3.1) is optimized again without these rows and a new value of k 
is found by cross-validation which we will call k*. This is considered to be a 
“robust” value of k. 

4. All rows are again used to optimize (3.1) with k fixed at fc*. Rows with 
dummy variables having \t\ > 4 are now removed. 

5. The optimization problem (3.1) is now solved for the remaining rows with k 
found by cross-validation. Explanatory variable selection is done using |t| > 2 
as the selection criterion. 
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The above procedure provides a fairly severe initial “pruning” of the data 
where the pruning is done based on the simultaneous weighting (in the principal 
component coordinate system) of both the real explanatory variables and row 
dummies. Then k* is found as a working guess at the ratio of the likelihood variance 
to the prior variance (in the ridge or Bayes regression sense). In effect, this keeps 
k from getting too small due to a large ||/3|p caused by a large number of outliers. 
Final row removal using |t| > 4 is done to maintain efficiency for the Gaussian 
case. 



4. Sampling Methods 

Ronchetti et al. (1997) have addressed variable selection and robustness using 
bounded influence methods whose breakdown decays as p increases. To maintain 
high breakdown for larger values of p, sampling methods (rather than true opti- 
mization methods) appear to be essential. 

The traditional sampling approach is to argue that we can, with high proba- 
bility, find a “good” subset of the data where our robust objective function would 
be minimized. Of course, we may never get such a data set in our sample, and the 
true minimum would be missed and our estimates would be wrong. These sampling 
methods (e.g., Rousseeuw and Leroy, 1987) have, so far, not been used directly 
in connection with variable selection. They have been used as a way to “clean” 
or “pre-screen” the data on the full model (all variables present) before variable 
selection is applied (Ronchetti et al., 1997) and to determine which outliers to 
consider in combination with a Bayesian selection method (Hoeting et al., 1996). 

Since robust sampling methods like least median of squares are already com- 
putationally intensive, the idea of using all possible subsets selection with such 
methods on large problems is not computationally feasible. 

4.1. Model Selection By Random Draws 

In this section we study the basic algorithm using simple random sampling. The 
underlying idea is to draw models and observations at random, to fit this model 
by least-squares to these observations and to evaluate the predictive quality of 
the fitted model on the remaining observations. This procedure is summarized in 
Table 1. With p variables, there are 2^ potential models and with N drawings, each 
variable will be included in about N/2 of the drawings, each pair of variables will be 
included in about N/A of the drawings, and so on. Each of the models is expected 
to be sampled N/2^ times. The effectiveness of such a scheme deteriorates when 
p is large, or else N will have to be increased correspondingly. Robustness can be 
achieved easily and simply by choosing a robust measure of the predictive quality. 
In our implementation, we have used the median absolute deviation (MAD) of the 
prediction residuals. 

The crucial question is how to use the result from this algorithm for the final 
variable selection. For the formal description, we need to introduce some notation. 
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Table 1. Algorithm for variable selection based on randomly 
drawn subsamples and models. 



Step 


Action 


Initialization 


Choose values for the number of draws X > 0, for the 
number of best models stored rib > 0 and for the number 
of observations randomly chosen during the model fitting 
n> ris > p. 


Drawing 


Draw a random model by selecting each variable at 
random with probability 1/2 and choose a random subset 
of size ris from the n observations. 


Evaluate 


Compute the least squares fit of the model drawn with 
the selected data and compute the predictive residuals 
on the unused data; as a measure of prediction quality 
compute the MAD of these residuals (predictive MAD). 


Summarize 


Keep score of the rib models with the lowest predictive MAD. 



Let i be an index for the best models found (z = 1, . . . , n^) and let m{i^j) code for 
the variables included in the model, in the sense that 

f 0, if variable j is not included in model i ; 

TYl[i o) = < ’ ’ 

^ ^ ^ [1, if variable j is included in model i ; 

(z = 1, . . . , n?,; j = 1, . . . ,p). Furthermore, let pMAD(z) be the predictive MAD for 
model z. We wish to interpret the predictive MAD as an indicator for the predictive 
power of a model. However, in doing so, we must keep in mind that this indicator is 
highly variable and it would, for example, be naive to expect that the overall best 
model, that is the one with the lowest predictive MAD, is a good final selection. Let 
Xi, . . . ,Xm be independent and identically distributed normal random variables 
with expectation 0 and variance and let MAD = med{\Xi\, . . . , |Xm|). It follows 
that the asymptotic expectation is r = 0.674 cr, whereas the asymptotic standard 
deviation is 1.2 For small values of m, the relative uncertainty in the 

estimation of r as measured by plus/minus one standard deviation is huge. The 
relative error when m = 20, for example, is 50%. For m = 100 it still is more than 
20% and when m = 500 we should expect a 10% error. Depending on the number 
of observations used in the evaluation of the predictive residuals m = rz — ns, a 
change of the predictive MAD by a factor of two is thus no indication of a real 
difference in the underlying predictive power. 

In order to take account of this variability, we keep a number of the best 
models rather than a single one. The simplest method for determining the final 
selection is based on a summary of the form m(+, j) = i)* The ordering 

of nz(+, j) can be taken as an indication for the relative importance of the variables. 
Under the assumption that none of the variables is needed, m(+, j) has a binomial 
distribution with parameters rib and 1/2 and - based on the normal approximation 
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and the Bonferroni inequality - an observed value larger than nb*(0.5 + $“^(l — 
0.05/p)/ (2 y/nb) is indicative of a significant departure. 

Modifications of the basic scheme, for example, by periodically nominating 
as outliers the observations contributing above average to the worst models, may 
improve the performance, but we have not pursued such ideas. 



5. Some Comparisons 

We have discussed a variety of methods for the selection of cases and explanatory 
variables in the regression setting. To compare all possible combinations of case 
selection (and/or weighting) with variable selection is beyond the scope of this 
paper. Our benchmark is the experimental set-up in Ronchetti et al. (1997). The 
situations considered in this paper are regressions with six variables and four error 
distributions, namely Gaussian (el), 7% wild (e2), slash (e3) and 10% asymmetric 
wild (e4). The reader is invited to consult the original paper for details. We consider 
model p2, where the last variable xq is superfluous, and n = 60 observations. In 
Table 2, we compare our methods (the rows called “draws” and “pen”) with their 

Table 2. This table shows the number of times that the correct 
model variables have been found in 200 simulations for n = 60 
and model p2 with the methods discussed above. The row labelled 
draws shows the number of correct models obtained when using 
a cutoff value of 0.51, which is slightly smaller than the value 
proposed in the text, which for rib = 200 is equal to 0.53. 









n = 


60, model p2 








without leverage 


with leverage 




el 


e2 


e3 


e4 


el 


e2 


e3 


e4 


LS 


188 


47 


0 


3 


186 


44 


0 


1 


rob 


157 


160 


10 


167 


169 


166 


7 


172 


draws 


160 


141 


9 


141 


170 


150 


15 


173 


pen 


164 


166 


10 


141 


80 


86 


45 


66 


LMSA 


139 


141 


129 


141 


52 


42 


60 


44 



bounded influence method (“rob”) and the all possible subsets (LS) method of 
Shao (1993). Since bounded influence methods have lower breakdown in higher 
dimensions, we have also included “LMSA” which uses least median of squares to 
find a “clean” subset of the data with all explanatory variables present. The clean 
set is defined by eliminating those observations with residuals greater than the 80th 
percentile in absolute value. Variable selection is accomplished by first ordering 
all possible subsets (models) using the single row deletion PRESS statistic. Since 
the PRESS statistic is subject to variation, we look at the lowest values of PRESS 
until the first “big” jump in PRESS values occurs. A big jump is defined as a jump 




Algorithms for Robust Model Selection 



205 



bigger than 90% of all the jumps. Those variables included in models with PRESS 
values below the jump is taken as the final selected model. 

The penalty method that just solves equation (3.1) and then selects variables 
with \t\ > 2 (no steps 2 through 4) does very well in the “nice” case (el, p2, no 
leverage) giving Correct 182 and Extra 18. Hence, for clean data, it would appear 
to be a viable (and cheaper) selection tool when compared to all possible subsets 
(LS) which gives the correct model in 188 out of 200. However, this simple penalty 
method does not do well with contaminated data (nor does LS for that matter) 
and the iteration steps 2 to 4 are required. 

6. Conclusions 

For the number of variables and observations considered here, sampling variables 
and observations simultaneously provides good performance in all cases, except 
when the errors follow a slash distribution. It is, of course, a computationally 
intensive method and how it will scale to larger problems remains to be seen. We 
plan to investigate various combinations of sampling and penalty methods to see if 
reasonable performance can be achieved on larger scale problems. In bioinformatics 
and data-mining there are many applications where p exceeds n and n is not too 
large. We are hopeful that the methods discussed in this paper will be particularly 
useful in these areas. 
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Analyzing the Number of Samples 
Required for an Approximate 
Monte- Carlo LMS Line Estimator 
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Abstract. This paper analyzes the number of samples required for an approxi- 
mate Monte-Carlo least median of squares (LMS) line estimator. We provide a 
general computational framework, followed by detailed derivations for several 
point distributions and subsequent numerical results for the required number 
of samples. 
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Keywords. Least median of squares (LMS) line estimator, Monte-Carlo LMS, 
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1. Introduction and Problem Definition 

We are given a set of n observations in the plane {xi,yi), for i = 1, . . . ,n, and 
consider the linear regression model (with intercept) 

yi = {axi + b)+ n, 

where a and b are the slope and y-intercept of the regression line and is the 
residual associated with the observation {xi^yi). We consider robust regression, in 
which the goal is to estimate the line parameters (a, b) that fit the bulk of the data, 
even when outliers are present. We assume that an upper bound on the fraction 
of outliers q, 0 < q < 0.5, is known, so that the number of outliers is at most qn. 
We make no assumptions about the distribution of these outliers. The remaining 
points, of which there are at least (1 — q)n are called inliers^ and are assumed to 
lie near the line (see further specifications below). 

A popular approach for robust estimation of this type is Rousseeuw’s least 
median of squares (LMS) line estimator (Rousseeuw, 1984). The estimator com- 
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putes the line of fit that minimizes the median squared residual (or any order 
statistic). Define a strip to be the region bounded between two nonvertical, paral- 
lel lines, and define the vertical width of a strip to be the vertical distance between 
these lines. LMS is computationally equivalent to determining a strip of minimum 
vertical width that contains at least 50% of the points. Thus, LMS estimation 
problem can be reduced into a geometric optimization problem. 

A number of algorithms have been derived for the computation of the LMS 
estimator. Stromberg (1993) gave an exact algorithm for computing the LMS hy- 
perstrip in d-dimensional space. His algorithm runs in 0(n^ logn) time for simple 
regression. A slightly improved algorithm which runs in 0{n^) time is due to Ag- 
ullo (1997). The best algorithm known for finding the LMS strip (in the plane) is 
the topological plane-sweep algorithm due to Edelsbrunner and Souvaine (1990). 
It runs in O(n^) time and requires 0(n) space. However, even quadratic running 
time is unacceptably high for many applications involving large data sets. 

In practice, simple Monte-Carlo approximation algorithms are often used. 
The feasible solution algorithm due to Hawkins (1993) is one such algorithm. Addi- 
tional Monte-Carlo-based algorithms are described in Rousseeuw and Leroy (1987) 
and Rousseeuw and Hubert (1997). Some number of pairs of distinct points are 
sampled at random. Each pair of points determines the slope of a line. The op- 
timal intercept is then computed by a process called intercept adjustment, where 
the optimal intercept is determined by reducing the simple regression problem to 
an LMS location estimation problem. This can be done in 0(n log n) time (see 
Rousseeuw and Leroy, 1987). Thus for a fixed number of subsamples, it is possible 
to determine the best LMS line approximation in 0(n log n) time. This is equiva- 
lent to finding the narrowest strip containing (at least) half of the points over all 
the strips of slopes defined by the subsamples. The midline of the narrowest strip 
is the LMS line approximation. 

Although inliers are usually assumed to lie exactly on the line, this is rarely 
the case in practice. We assume that the distribution of residuals is known and 
is independent of the x-coordinate. The main issue considered is how many pairs 
of points should be sampled to guarantee that, with sufficiently high probability, 
this Monte-Carlo estimate returns a sufficiently close approximation to the LMS 
line of fit? The meaning of “sufficiently close” is based on the formulation given by 
Mount et al. (1997). Let w* denote the vertical width of the narrowest strip con- 
taining (at least) 50% of the points. Given an approximation error bound, e > 0, 
we say that any strip that contains (at least) 50% of the points and is of vertical 
width at most (1 + e)w* is an e-approximate solution to the LMS problem. Given 
a confidence probability Pc^ our goal is to compute the minimum number of ran- 
dom samples needed, so that the Monte-Carlo algorithm returns an e-approximate 
solution to the LMS problem with probability at least Pc- This analysis is based 
on some assumptions about the distribution of inliers, which are given in the next 
section. 
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2. Assumptions on the Point Distribution 

Suppose that the line in question is given hy y = ax + b. As previously noted, we 
make some concrete assumptions about the distribution of the inliers. Specifically, 
we assume that the ^-coordinates of the points are uniformly distributed over an 
interval [xq, xi], and that the y-coordinate for a point with x-coordinate x is ax-\-b-\- 
2 , where the deviate z is a random variable having a certain probability density 
function (pdf). In Section 4 we consider specifically Gaussian and exponential 
pdf’s. (Assuming normally distributed residuals is fairly common. The exponential 
pdf was chosen as a case study for a one-sided distribution.) In the full paper we 
also consider the uniform pdf (Mount et al, 2003). 

Because the approximation bound is based on the ratio of vertical widths of 
two parallel strips, we may apply any affine transformation to the plane that pre- 
serves such ratios. In particular, in order to simplify the analysis, we transform the 
x-coordinates to lie within the interval [0, 1], and we transform the y-coordinates 
by subtracting ax — 6, so that the line of fit is mapped to the x-axis, and finally we 
scale the y-coordinates so that the standard deviation of residuals is some desired 
value. After this transformation, the optimal slope is zero. It is easy to verify that 
all these transformations preserve the ratio of vertical widths of parallel strips. 
Our analysis holds in the limit as n tends to infinity. 



3. Sketch of Computational Procedure 

Our analyzes for the various residual distributions are all based on a common 
approach. Any sampled pair of points defines a line with some slope s. Based on 
our knowledge of the distributions of the x- and ^-coordinates of the inliers, it will 
be possible to derive the pdf of s, for pairs of inliers. In later sections we provide 
this derivation, but for now, let fs{s) denote this slope pdf. 

Consider a parallel strip whose sides are of slope 5, whose central line has 
y-intercept 6, and whose vertical width is w. Let Ab{s,w) denote the probability 
mass of this strip relative to the given point distribution. We will omit b when 
it is clear from context. The algorithm computes a strip of slope s that contains 
at least half of the points. Since we know nothing about outliers, if we want to 
guarantee that at least half of all the n points lie within a given strip, the strip 
should contain a fraction of at least 1/(2(1 — g)) of the (1 — g)n inliers. Throughout, 
we let = 1/(2(1 — g)) denote this fraction of inliers. (Recall that g < 0.5.) Define 
the slope-width function, w{s), to be the vertical width of the narrowest strip that 
has this fraction of the inlier probability mass. That is, 

w{s) = argmin^ 3b {Ab{s,w) > Ti ) . (3.1) 

For most reasonable residual distributions, we would expect that w[s) in- 
creases monotonically relative to its optimum value of w(0), denoted w*. (See 
Figure 1.) This can be shown for the distributions presented here, by a straight- 
forward perturbation argument, which will be presented in the full version of the 
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®min 

slope 



Figure 1. An illustration of the slope- width function w{s) and 
minimum and maximum allowable slopes. 

paper (Mount et ah, 2003). Let [5min,5max] denote the largest interval such that, 
for all s G [^min? 5max]> '^{s) < (1 + e)w* . A slopc s IS 6-good if it lies within this 
interval. A sampled point pair is e-good if the line passing through this pair has 
an e-good slope. Recalling the fs{s) is the slope pdf, the probability of randomly 
sampling an e-good point pair, denoted Pg, is 

/ Smax 

fs{s)ds. (3.2) 

- min 

Finally, given that with probability Pc at least one good sample should be 
chosen, it follows that if N is the number of point pairs to be sampled, the prob- 
ability of generating no good pair in N trials should be at most 1 — Pc- Since 
we assume that sampling is performed independently, and since the probability of 
selecting a pair of inliers is (1 — N should be selected such that 

(1 - (1 - g)%)^ <1-Pc- 

Solving for N, we obtain the desired bound on the number of samples. 

JV> Pc) 

log(l - (1 - g)%) 

The remainder of the analysis reduces to computing the slope density function 
fs{s), computing the slope- width function w{s), inverting this function to compute 
Smin and Smax, and finally solving the integral of Eq. (3.2). In the rest of the paper 
we will provide detailed derivations of these results for two common residual pdf’s. 



4. Derivation of Slope Distribution 

Let Pi = {xi^pi), pj = {xj^pj) denote two distinct inlying points sampled at 
random. According to the point distribution model described in Section 2, 

Xj are independent, uniformly distributed in the interval [0,1], that is, Xi^Xj ~ 
U{0, 1) and pj are independent, identically distributed (i.i.d.) according to some 
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specified pdf. We make the general position assumption that the ^-coordinates of 
the points are distinct. The slope of the line joining pi and pj is 



Xj - Xi 



(4.1) 



Thus to find fs{s)^ it suffices to find the pdf’s of [pj — yi) and {xj — xi), and 
then find the pdf of their quotient. The latter can be found, using the following 
result from basic probability theory (Fisz, 1963; Papoulis, 1965). Let S = V/U 
denote the quotient of two random variables V and U . The pdf of S is given by 
fs{s) = fvu{us,u)\u\du^ where fvu{-) is the joint pdf of V and U. Thus if 
V and U are independent, 




fv{us)fu{u)\u\du. 



(4.2) 



Let U and V denote, respectively, the random variables Xj - xi and pj - pi. 
The pdf of U is given by the convolution of the pdf’s corresponding to ZY(0, 1) and 
ZY(-1,0). It is easy to show that 



fu{u) 



1-fu — l<ix<0 
1 - u 0 <u <1 

0 otherwise. 



(4.3) 



In the following subsections we derive the pdf of V for two common distributions. 
Also, we derive for each case the resulting pdf of the slope, using (4.2). 



4.1. The Gaussian Case 

We assume, without loss of generality, that Pi^Pj ~ ^^(0, 1). Thus -pi ~ jV( 0, 1), 
as well. Based on the fact that the sum of (two) independent, normally distributed 
random variables has a normal pdf whose expected value is the sum of the expected 
values and whose variance is the sum of the variances, we have pj — pi Af{0,V2). 
In other words, fv{v) = (l/(20r)) exp(— v^/4). Employing (4.2) we thus obtain 

fs{s) = ^ fu{u)\u\du, (4.4) 

where k{s) = s^/4. Since in view of (4.3) the integrand is an even function of u, 
we may rewrite the above equation as: 

fs{s) = -^ [ - u)udu. (4.5) 

Jo 

Solving the definite integral yields the following bottom-line expression: 

fs{s) = 2^k{s) “ 0-5W^(s) erf(v/^)) , (4.6) 

where erf(x) = (See Mount et al., 2003, for a detailed deriva- 

tion.) Figure 2(a) gives a plot of fs{s) for the Gaussian case. Note that although 
the denominator of fs{s) approaches zero as s — > 0, the function is well defined. 
Specifically, it can be shown that the limit of fs{s) as s 0 is equal to 1/(67t). 
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4.2. The Exponential Case 

We now consider the exponential case. Without loss of generality, we assume that 
Vi^Vj ~ 5(1), that is, yj are independent, exponentially distributed with A = 1. 
(This one-sided distribution is a special case of the gamma and Weibull distribu- 
tions.) In other words, yj ~ e~'^H{y) and —yi ~ e'^H{—y), where H{y) denotes a 
step function, such that H{y) = 1 for y > 0 and H{y) =0 otherwise. It can be 
shown that the convolution of these pdf’s results in a Laplace pdf (see Mount et 
al, 2003): 

2/j ~ (4.7) 

Employing (4.2) we now obtain 

fs{s) (4.8) 

As before, since the integrand is an even function of we may rewrite this as: 

fs{s)= [ — u)udu. (4.9) 

Jo 

Solving the definite integral yields the desired slope pdf: 

fs{s) = ^ ((|s| + 2)e-l*l + |s| - 2 ) . (4.10) 

(See Mount et al., 2003, for a detailed derivation.) Although the denominator is 
equal to zero at s = 0, fs{s) is well-defined at the origin. In particular, it can be 
shown (by a Taylor expansion of the numerator up to terms of degree 4) that the 
limit of fs{s) as 5 ^ 0 is equal to 1/6. Figure 2(b) gives a plot of fs{s) for the 
exponential case. 




Figure 2. Slope pdf: (a) Gaussian case, (b) exponential case. 
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5. Derivation of the Slope- Width Function 

In this section we compute the slope- width function w{s), defined in Section 3. 

5.1. The Gaussian Case 

We consider first the special case s = 0 for which w{s) = w{0) = w*. We make the 
observation that the desired parallelogram (a horizontal rectangle of width w* in 
this case) must be symmetric with respect to the x-axis. By definition of the pdf 
of the ^-coordinates, it is easy to show that the “area” (that is, probability mass) 
of any other horizontal rectangle of width w* will be smaller than A(0, u;*) = 
Ao{0,w*). Thus 

/ w* /2 pi -I pw* /2 

/ fY{y)dxdy = -= / I’^dy, (5.1) 

-w*l2J0 v 2'K J-w*l2 

that is, 

/ rt pw* f2 

-j e~y"l‘^dy. (5.2) 

Given ^/, we can find it;* (the solution to the implicit equation A(0,u;*) = ^/), 
by using standard numerical analysis techniques (Atkinson, 1978), such as the 
Gaussian quadrature (to approximate the various definite integrals) and the iter- 
ative bisection method for root-finding of an equation. For = 0.5, for example, 
we obtain u;* 1.349. 

We now assume, without loss of generality, that s > 0, and consider a parallel- 
ogram of slope s and width w. It can be shown that among all such parallelograms, 
the one that is symmetric with respect to the line y = sx — s/2 has maximum 
area. Specifically this implies that y = sx — s/2 is also the midline of the nar- 
rowest strip containing half of the data points. (Moving any other strip in the 
direction of the above midline increases the probability mass; thus the width of 
any other strip is not a local minimum, in the sense that if the probability ma^ss 
exceeds, say, 0.5, it is possible to locally shrink that strip so that it contains half 
of the data.) Henceforth, let A{s^w) = A_ 5 / 2 (s, it;). The non- vertical sides of the 
desired parallelogram are thus given by the line equations: yj = sx — {s — w)/2 
and yii = sx — {s w)/2. We distinguish between two cases: (1) s < w and (2) 
s > w (see Figures 3(a) and 3(b), respectively). Due to space limitations, we only 
discuss here the first case. (See Mount et al., 2003, for further details.) 

We decompose the desired parallelogram into the horizontal rectangle whose 
sides are y = ±{w — s)/2, and the triangles above and below this rectangle. The 
area (i.e., probability mass) of the two triangles is clearly the same. Thus the total 
probability mass of the parallelogram is given by 

p(w-s)/2 p-(w-s)/2 

A{s,w)^ /y(y)dj/ + 2 / xn{y)fY{y)dy, (5.3) 

J — {w—s) /2 J -{w-\-s) /2 
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Figure 3. The low and high slope cases for the Gaussian distribution. 



where xu{y) = (1/s)?/ + (1 + w/s)f2, which yields 

Aw - s)/2 2 /•- («'-«)/2 / f -{ w - s)/2 

A{s,w)= fY{y)dy+- yfY{y)dy+[l + -) fY{y)dy. 

J — {w—s)/2 ^J — {w-\-s)l2 ^ ^ ' J -{w-\-s)/2 

Solving the second definite integral leads to the following expression for A{s^w)\ 



A{s,w) 





+ 



Using standard numerical analysis techniques, as before, we can obtain numerical 
solutions for the implicit equation A{s^w{s)) = Ti. That is, for a given Ti and s, 
we can look up w{s) and vice versa. 

Figures 4(a), 4(b) show plots of the strip width as a function of slope, for the 
Gaussian case and the exponential case, respectively, for various values of Ti. 



Gaussian Case: Width vs. sbpe 



Exponential Case: Width vs. slope 





Figure 4. Strip width as a function of slope for various values: 
(a) Gaussian case, (b) exponential case. 
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5.2. The Exponential Case 

We consider first the special case s = 0. Given the one-sided pdf of the y- 
coordinates, it is obvious that the lower edge of the desired strip (of width w*) 
coincides in this case with the x-axis. Thus letting = Ao(0,tt;*) we have 

nW* 

A{0,w*) = / e~'^dy = l — e~^ , (5.4) 

Jo 

that is, for a given Tj, we have 

= In (j-^) . (5.5) 

Because of the lack of symmetry, determining the optimal strip for s 7 ^ 0 is 
much less obvious than the Gaussian case. To do so, we will derive the probability 
mass associated with a differential strip of slope s, that is, the probability density 
of a line of slope 5 as a function of its y-intercept. Consider the family of lines 
y = sx+yint, where the intercept y\nt is a random variable. (Assume that s > 0 . An 
analogous analysis to the one below applies to s < 0.) To obtain the distribution 
of yint? that is, F{z) = V{y\nt < z), we distinguish between the two cases: (1) 
z >0 and (2) — s < z < 0. The distribution is 0 for 2 : < — s, due to the underlying 
exponential pdf of the ^-coordinates, and is essentially equal to the probability 
mass of the strip defined hj y = sx z and y = sx — s (see Figure 5). We have 
for z > 0 



J 

z 

, 0 




1. 


w 

0 


x=-z/s 


1 fw 


-s 




z 

-s 


r 





(a) <b) 



Figure 5. Regions for probability mass computation due to (a) 
(5.6) and (b) (5.7). 



nl / pSX-\-Z \ nl 

F{z) = V{y\nt < 2 ) = y ij e~ydyj dx = j (l 



— e 



^) dx. 



which yields 
For — 5 < 2 : < 0 we have 



F{z) = V{y\nt < z) = 1 + - (e ^ ^ - e ^) . 



(5.6) 



/ I / nSX-{-Z \ ^1 

( / e~'^dy jdx= (l - dx, 

-z/s \Jo / J—z/s 
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which yields 

F{z) = 1 + ^ + i (e-"-* - 1) . (5.7) 

Differentiating the above distribution with respect to z provides the probability 
density of a line having a slope s and y-intercept z. We obtain 

(5.8) 

i (e-^ - z>0. 

It can easily be seen that the above pdf is unimodal. The mode occurs at z = 0 
and its value is /z(0) = m, where m — (1/s) (1 - e~^). 

Having established (for a given slope s > 0) the pdf of a line’s intercept and 
its unimodality, we make the observation that the width of the desired strip is 
given by w{s) = yi - yo, for some yo < 0, > 0, such that: 

(1) fz{yo) = /z(yi), and 

(2) F{y,) - F{yo) = A{s,w{s)) = Ti. 

In this context, it is understood that A{s, w{s)) applies to the strip bounded by the 
y-intercepts yo and y i . The first observation arises as a local-minimality constraint 
on the strip lying between the intercepts [yo, yi]. (If this did not hold, a differential 
argument shows that an infinitesimal shift of the strip, either up or down, would 
increase the probability mass without increasing the strip’s width, contradicting 
the strip’s minimality.) The second fact simply states that the strip must contain 
a fraction Fi of the probability mass. Prom (5.8) and the first constraint we obtain 

= e~y^ (5.9) 

and substituting (yo + w) for yi yields 

g-yo _ e-^-") = 1. (5.10) 

Prom (5.6), (5.7), we have 

F{yo) = 1 - - + ^ + - e-yo-^ 
s s s 

and 

F(j/i) = 1 + - (e-^ - 1) e-y^ 
s 

Prom the last two equations and the second constraint we obtain 

A(s, w(s)) = - ((e~^ — l)e~y^ + 1 — yo — = Fj. 

Combining the latter equation with (5.9) and (5.10) yields 

A{s,w{s)) s = -yo, 
or 

A{s^ w{s)) s = — In (e“^ + e~'^ — = —Fjs. 
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Hence we get the following closed-form expression 

Finding s from a given w does not yield a closed-form solution. As in the previous 
subsection, a standard iterative root-finding technique was used. 

6. Numerical Results 

We show the concrete results of the analysis for the Gaussian and exponential 
distributions. Figure 6 shows the probability that single random sampled pair is 
good as a function of e. Recall that a sampled pair is good if there is a strip 
whose slope is determined by the pair, that contains a fraction of Tj of the inliers, 
and whose width is at most (1 + e) times the optimum width. Recall that Tj = 
1/(2(1 — g)), where q is the expected fraction of outliers, and hence the values 
= {0.5, 0.7, 0.9} shown in the plots correspond to outlier fractions of ^ = 
{0,0.29,0.44}, respectively. Figure 7 shows the required number of samples as a 
function of e for various confidence levels Pc? and for the fixed value = 0.5. 



Gaussian Case: Pr(good) vs, Epsilon 



Exponential Case: Pr(good) vs. Epsilon 





(a) 



(b) 



Figure 6. Probability that a single random sample is good as a 
function of e for various Tj values: (a) Gaussian case, (b) expo- 
nential case. 
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Visualizing ID Regression 

D.J. Olive 



Abstract. Regression is the study of the conditional distribution of the re- 
sponse y given the predictors cc. In a ID regression, y is independent of x 
given a single linear combination /3^x of the predictors. Special cases of ID 
regression include multiple linear regression, binary regression and general- 
ized linear models. If a good estimate b of some non-zero multiple c/3 of (3 can 
be constructed, then the ID regression can be visualized with a scatterplot of 

-T 

b X versus y. A resistant method for estimating c/3 is presented along with 
applications. 

Mathematics Subject Classification (2000). Primary 62G35; Secondary 62F35. 
Keywords. Diagnostics, generalized linear models, minimum covariance deter- 
minant, response transformations, single index models. 



1. Introduction 

Regression is the study of the conditional distribution y\x of the response y given 
the (p — 1) X 1 vector of nontrivial predictors x. In a, ID regression (or regression 
with 1-dimensional structure), y is conditionally independent of x given a single 
linear combination /3^a? of the predictors, written 

pJLx|/3^aj. (1.1) 

A ID regression model has the form 

y = g{a + ^'^x,e) (1.2) 

where p is a bivariate (inverse link) function and e is a zero mean error that is 
independent of x. See Li and Duan (1989) and Cook and Weisberg (1999a, p. 414). 

The above class of models is very rich. A single index model uses 

y = g{a -h j3^ x, e) = m{a + /3^x) -h e, (1.3) 

This research was supported by NSF grant DMS 0202922. 
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and the multiple linear regression model is an important special case where m is 
the identity function: m{a + /3^x) = a + j3^x. Another important special case of 
ID regression is the response transformation model where 

g{a + 0^x,e) = t~^{a + j3^x + e) (1.4) 

and t~^ is a one to one (typically monotone) function so that t{y) = a -\- /3^x + e. 
Generalized linear models (GLM’s) are also a special case of ID regression. 

Some notation from the regression graphics literature will be useful. Dimen- 
sion reduction can greatly simplify our understanding of the conditional distribu- 
tion y\x. If a ID regression model is appropriate, then the (p — l)-dimensional 
vector X can be replaced by the 1-dimensional scalar (3^ x with no loss of infor- 
mation. A sufficient summary plot (SSP) is a plot that contains all the sample 
regression information about the conditional distribution of y\x. For ID regression, 
if p 11 x\/3^ X then y _IL x\cf3^ x for any constant c^ 0. The quantity cf3^ x is called 
a sufficient predictor (SP), and a plot of the SP versus p is a SSP. If a consistent 
estimator b of c/3 can be found for some nonzero c, then an estimated sufficient 

''T 

summary plot (ESSP) is a plot of the estimated sufficient predictor (ESP) b x 
versus y. 

Additional notation is needed before giving theoretical results. Let x, a, t, 
and /3 be (p — 1) X 1 vectors where only x is random. The predictors x satisfy 
the condition of linearly related predictors with ID structure (Cook and Weisberg, 
1999a, p. 431) if 

E[x\l3^x] = a-\- t(3^x. (1.5) 

Notice that /3 is a fixed (p — 1) x 1 vector. If x is elliptically contoured (EC) with 
1st moments, then the assumption of linearly related predictors holds. See Cook 
(1998a, p. 130). 

Following Cook (1998a, pp. 143-144), assume that there is an objective func- 
tion 

1 "" 

Ln{a,b) = -'^L{a + b^Xi,yi) (1.6) 

^ i=i 

where L(u, v) is a bivariate function that is a convex function of the first argument 
u. Assume that the estimate (a, b) of (a, b) satisfies 

(d, 6) = argminL^(a, 6). (1.7) 

a,b 

For example, the ordinary least squares (OLS) estimator uses 
L{a -h b^x, y) = {y — a — b^x)‘^. 

Maximum likelihood type estimators such as those used to compute GLM’s and 
Huber’s M-estimator also work, as does the Wilcoxon rank estimator. Assume that 
the population analog [a,r]) is the unique minimizer of E[L{a + b^x,p)] where 
the expectation exists and is with respect to the joint distribution of (p, For 
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Sufficient Summary Plot for Gaussian Predictors 




Figure 1. SSP for m{u) = u^. 



example, (a,r/) is unique if L{u,v) is strictly convex in its first argument. The 
following result is a useful extension of Brillinger (1977, 1983). 

Theorem 1. (Li and Duan, 1989, p. 1016): Assume that the x are linearly related 
predictors, that {yi,xj)^ are iid observations from some joint distribution and 
that Cov{x[) exists and is positive definite. Assume L(u,v) is convex in its first 
argument and that ry is unique. Assume that y JL x\f3'^x. Then r] = cj3 for some 
scalar c. 

Remark 1. If 6 is a consistent estimator of r/ = /3^, then certainly 

f3^ = Cf3 + Ug 

where Ug = — cf3 is the bias vector. If the conditions of Theorem 1 hold, 

then Ug = 0. Under additional conditions, (a, b is asymptotically normal (see 
Li and Duan, 1989, p. 1031). In particular, the OLS estimator frequently has 
a convergence rate. Often if no strong nonlinearities are present among the 
predictors, the bias vector is small enough so that b ^ cf3. See Cook and Weisberg 
(1999a, p. 431-441) for checking whether the predictors are linearly related and 
whether a ID regression model is appropriate. 

A very useful result is that if y = m{x) for some function m, then m can 
be visualized with both a plot of x versus y and a plot of c x versus y if c ^ 0. 
In fact, there are only three possibilities: if c > 0, then the two plots are nearly 
identical. If c < 0, then the plot appears to be flipped about the vertical axis. If 
c = 0, then the plot is a dot plot. Similarly, if yi = g(a + j3^Xi,ei), then the plot 
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The SSP using -SP. 




Figure 2. Another SSP for m{u) = u^. 



of /3^x versus y and the plot of x versus y will be nearly identical in overall 
shape if c > 0. 

Example 1. Suppose that Xi ^ Ns{0,ls) where Is is the 3x3 identity matrix, 
and 

y = m{l3^x) + e = (xi + 2 x 2 + + e with n = 100. 

Then a ID regression model holds with ^ = (1, 2, 3)^. Figure 1 shows the sufficient 
summary plot of (3^ Xi versus yi, and Figure 2 shows the sufficient summary plot 
of —(3^Xi versus yi. Notice that the functional form m appears to be cubic in both 
plots and that both plots can be smoothed by eye or with a scatterplot smoother. 
Remark 2. The OLS estimator (ao,6o) is obtained from the usual multiple linear 
regression of yi on but we are not assuming that the multiple linear regression 
model holds'^ however, we are hoping that the ID regression model yiLcc|/3^x is a 
useful approximation to the data and that ho ^ c(3 for some nonzero constant c. 
Theorem 1 provides some conditions for the above approximation to hold. Notice 
that if the multiple linear regression model does hold and if the errors are such 
that OLS is a consistent estimator, then c = 1, and Ug = 0. 

The following result, perhaps first noted by Brillinger (1977, 1983), is called 
the ID Estimation Resulthy Cook and Weisberg (1999a, p. 432): let (So, bo) denote 
the OLS estimate obtained from the OLS multiple linear regression of y on cc. The 

OLS view is a plot of x versus y. If the ID regression model is appropriate and if 
no strong nonlinearities are present among the predictors, then the OLS view will 
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frequently he a useful estimated sufficient summary plot. Hence the OLS predictor 

-T 

X is a useful ESP. 

Three additional methods considered in this paper that have proven useful for 
estimating the ESP are sliced inverse regression (SIR), principal Hessian directions 
(PHD), and sliced average variance estimation (SAVE). See Cook and Li (2002) for 
a discussion of when these methods can fail. These methods frequently perform 
well if there are no strong nonlinearities present in the predictors. All of these 
methods fail if c = 0 or if the bias vector Ug is “large” compared to c/3. For 
example, the OLS view can fail if the sufficient summary plot of f3^x versus y is 
approximately symmetric, and all of these methods can perform poorly if outliers 
are present. Gather et al. (2002) show that a single outlier can cause SIR to fail. 
Some useful references for SIR and related methods include Cook (1998ab, 2000), 
Cook and Weisberg (1999ab), Fung et al. (2002), Li (1991) and Stoker (1986). 

Ellipsoidal trimming is a method for estimating the ESP that can reduce the 
bias Ug. See Cook and Nachtsheim (1994) and Cook (1998a, p. 152). To perform 
ellipsoidal trimming, an estimator (T, C) is computed where T is a (p - 1) x 1 
multivariate location estimator and C is a (p - 1) x (p — 1) symmetric positive 
definite dispersion estimator. Then the ith squared Mahalanobis distance is the 
scalar 

= {xi-TfC-\xi-T) (1.8) 

for each vector of observed predictors Xi. If the ordered distance D(^j^ is unique, 
then j of the x^’s are in the ellipsoid 

{x:{x-TfC-Hx-T)<Dl^}. (1.9) 

The ith case {yi^xjffi is trimmed if D{ > Dq). Then an estimator of c/3 is com- 
puted from the untrimmed cases. For example, if j ^ 0.9n, then about 10% of the 
cases are trimmed, and OLS could be used on the remaining cases. 

The following procedure was suggested by Olive (2002). First compute (T, C) 
using the Splus function cov.mcd (see Rousseeuw and Van Driessen, 1999). Trim 
the K% of the cases with the largest Mahalanobis distances, and then compute 
the OLS estimator {otK^^K) from the untrimmed cases. Use A = 0, 10, 20, 30, 

- T 

40, 50, 60, 70, 80, and 90 to generate ten plots of /3j^x versus y using all n cases. 
These plots will be called “OLS trimmed views.” Notice that K = f) corresponds 
to the OLS view. The best OLS trimmed view is the trimmed view with a smooth 
mean function and the smallest variance function and is the estimated sufficient 
summary plot. If K* = E is the percentage of cases trimmed that corresponds to 

- T 

the best trimmed view, then /3^x is the estimated sufficient predictor. 

Example 2. For the data in Example 1, the OLS view is similar to Figure 1 except 
the plot is not quite as smooth and the horizontal scale is multiplied by c 42. The 
best trimmed view appears to be identical to Figure 1 except that the horizontal 
scale is multiplied hy 12.5. The OLS view used = (41.68, 87.40, 120.83)^ « 
42/3 while the best trimmed view used = (12.61,25.07,37.26)^ 12.5/3. 
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Forward Response Plot 




FIT 

Residual Plot 
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Figure 3. Plots for HBK Data. 

This section has reviewed the existing literature on ID regression. Section 2 
shows that ID regression can provide useful diagnostics when g is known. Section 
3 considers estimating g when g = g\ and A G {Ai,...,Afc} where fc is a small 
integer. Section 4 suggests using ellipsoidal trimming with methods other than 
OLS. This technique gives a resistant version of SIR and shows that the Spins 
function Imsreg can be very useful for finding certain types of curvature. 



2. ID Regression Diagnostics 

In this section, we suggest that when g is known, an estimated sufficient summary 
plot should be used in addition to the usual diagnostics for checking the model. 
Assume that the ID regression model is yi = g{a-\- /3'^Xi^ei) for i = 1, . . . , n where 
the 6i are iid with zero mean and variance V{ei) = 

Remark 3. If y JL x\/3^ x then y Mx\a-\- c/S^x for any constants a and 0. Hence 
if 6 a? is an ESP, so is a + 6 x. 

A good example is the multiple linear regression (MLR) model yi = a -\- 
(3^Xi + 6i. Let (a, 6) be a MLR estimator of {a,/3). Then the fitted values are 

-T 

yi = a + b Xi^ and the residuals are ri = yi — yi. The most used residual plot is a 
plot of yi versus r^, and the forward response plot is a plot of the fitted values yi 
versus the response yi. 

Remark 3 shows that the forward response plot is an ESSP. Let the scalars 
Wi = a -\- /3^Xi. Ignoring the errors gives yi = Wi which is the equation of the 
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identity line that has unit slope and zero intercept. Hence if the MLR model is 
appropriate and if (a, 6) is a good estimator of (o;,/3), then the plotted points 
in the forward response plot should scatter about the identity line. The vertical 
deviations from the identity line are the residuals ri since these deviations are 
Vi — Vi- When the OLS estimator is used, the coefficient of determination B? is 
equal to the squared correlation of yi and yi. See Chambers et al. (1983, p. 280). 

High leverage outliers challenge conventional numerical MLR diagnostics such 
as Cook’s distance (Cook, 1977), but, as shown in Example 3 below, can often be 
detected using the forward response and residual plots. Using trimmed views (see 
section 4) is also effective for detecting outliers and other departures from the 
MLR model. 

Example 3. In the well known artificial HBK data set (Hawkins et al., 1984), the 
first 10 cases are outliers while cases 11-14 are good leverage points. This data 
set has n = 75 cases and p — 1 = 3 nontrivial predictors. Figure 3 shows the 
residual and forward response plots based on the OLS estimator. The highlighted 
cases have Cook’s distance > min(0.5, 2p/n), and the identity line is shown in the 
ESSP. 

Now suppose that model (1.2) holds where g is known. If the estimator (a, b) 

satisfies b cf3^ then the plot of a + 6 x or 6 x versus y can be used to visualize 
g provided that c ^ 0. Since g is known, the classical (e.g., maximum likelihood) 
estimator for /3 should be used since then c = 1 and the bias vector should be 
small for large sample size n. Often adding a parametric fit and a lowess smooth 
to the plot will be useful. 

Plots are also useful for additive error models 

yi = m(xi,i, . . . ,Xi,p_i) + Ci = rn{xj) + + e^. 

Many anova, categorical, nonlinear regression, nonparametric regression, and 
multi-index models have this form. For these models yi = mi and the residu- 
als Vi = Vi — Pi. In the fit-response plot (FY plot) of yi versus the plotted 
points should scatter about the identity line, and the vertical deviations from the 
identity line are equal to the residuals. 



3. ID Regression Model Selection 

In this section, we assume that a ID regression model holds with g = gx^ where 
Ao G A = {Ai,...,Afc} and A; > 1 is a small integer. To estimate A^, make an 
ESSP for each of the k possible models. Examples include choosing a frequentist 
or a Bayesian model; a proportional hazards model or one of several competing 
ID survival models; a logistic, probit or complementary log-log model in binary 
regression; a full or sub model in variable selection. 

A good example for illustration is the response transformation model 

tXoiVi) = = ao + /3o + et 



(3.1) 
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a) OLS. LAMBDAsI 



b] LMSREG, LAMBDA=1 
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Figure 4. OLS and LMSREG Suggest Using log(y) for the Tex- 
tile Data. 



where the response variable yi > 0 and the power transformation family 

tx{y) = (3.2) 

for A 7 ^ 0 and y^^^ = log(y). Assume Ao € A = {0, ±1/4, ±1/3, ±1/2, ±2/3, ±1}. 

The literature for estimating Ao is enormous, and at least two papers using 
results from ID regression have appeared. Let the OLS estimator (So, bo) be com- 
puted from the multiple linear regression of yi on xi. Then Cook and Weisberg 
(1994) suggest that the inverse response plot of y versus yoLS will often show 
t\^. Hence the forward response plot of yoLS versus y will show t~f^. If t\ is the 
appropriate transformation. Cook and Olive (2001) suggest that a plot of yoLS 
versus t\{y) will follow the identity line. 

These ideas suggest a graphical method for selecting response transformations 
that can be used with any good MLR estimator. Let wi = t\{yi) for A 1, and 
let Wi = yi if X = 1. Next, perform the multiple linear regression of Wi on Xi and 
make the forward response plot of Wi versus Wi. If the plotted points follow the 
identity line, then take Ao = A. One plot is made for each of the eleven values of 
A G A, and if more than one value of A works, contact subject matter experts and 
use the simplest or most reasonable transformation. (Note that this procedure can 
be modified to create a graphical diagnostic for a numerical estimator by adding 
the estimate A of Ao to A.) 
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a) SIR 



b)PHD 




Figure 5. Estimated Sufficient Summary Plots. 



Example 4. A textile data set is given in Box and Cox (1964) where samples of 
worsted yarn with different levels of the three factors were given a cyclic load 
until the sample failed. The goal was to understand how y = the number of cy- 
cles to failure was related to the predictor variables length, amplitude and load. 
Figure 4 shows the forward response plots for two MLR estimators: OLS and the 
Splus function Imsreg. Figures 4a and 4b show that a response transformation 
is needed while 4c and 4d both suggest that log{y) is the appropriate response 
transformation. Using OLS and a resistant estimator as in Figure 4 may be very 
useful if outliers are present. 



4. Improving ID Estimators 

Assume that the ID regression model (1.2) holds but both g and /3 are unknown. If 
a good estimator b ^ cf3 can be found where c ^ 0, then the ESSP can be used to 
visualize g. The next step might be to fit a tentative parametric or nonparametric 
model in the ESP. 

Since methods for estimating the sufficient predictor such as OLS, PHD, 
SAVE and SIR can fail if strong nonlinearities such as outliers are present in the 
predictors or if 6 0 /3, techniques for improving these methods are needed. The 

basic tool will be to use (6, T, C) trimmed views where b is an estimator of c/3 and 
(T, C) is an estimator of multivariate location and dispersion. Two good choices 
for (T, C) are the classical estimator (Cook 1998a, p. 152) or a robust estimator 
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LMSREG TRIMMED VIEW 




Figure 6. The Weighted Imsreg Fitted Values vs. Y. 

such as cov.mcd. Then, for example, SIR trimmed views generalize the SIR ESSP 
in the same way that OLS trimmed views generalize the OLS view: use ellipsoidal 
trimming to delete the K% of the cases with the largest Mahalanobis distances. 

Next, plot bj^Xi versus yi where bK is the first SIR direction computed from the 
untrimmed cases. Again use 10 values of K where K = 0 corresponds to the usual 
SIR ESSP, and the best SIR trimmed view is the trimmed view with a smooth 
mean function and the smallest variance function. 

Using trimmed views seems to work for several reasons. The ellipsoidal trim- 
ming divides the data into two groups: the trimmed cases and the untrimmed 
cases. Trimming often removes strong nonlinearities from the predictors, and the 
untrimmed predictor distribution is often more nearly elliptical contoured than 
the predictor distribution of the entire data set (recall Winsor’s principle: “data 
are roughly Gaussian in the middle”). Secondly, under heavy trimming, the mean 
function of the untrimmed cases may be more linear than the mean function of the 
entire data set. Thirdly, if \c\ is very large, then the bias vector Ug may be small 
relative to c/3. Prom Remarks 1 and 2, any of these three reasons could produce a 
better estimated sufficient predictor. Also notice that trimmed views are resistant 
to ^-outliers since the y values are plotted, and trimmed views are resistant to 
x-outliers if (T, C) is a resistant estimator. 

Example 5. To illustrate the above discussion, an artificial data set with 200 trivari- 
ate vectors Xi was generated. The marginal distributions of Xij are iid lognormal 
for j = 1, 2, and 3. Since the response yi = sin{l3^ Xi) / f3^ Xi where /3 = (1, 2, 3)^, 
the random vector Xi is not elliptically contoured and the function g is strongly 
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Table 1. Estimated Sufficient Predictors Estimating c(l,2,3)^. 



method 


h 


h 


h 


OLS View 


0.0032 


0.0011 


0.0047 


SIR 


-0.394 


-0.361 


-0.845 


PHD 


-0.072 


-0.029 


-0.0097 


SAVE 


-1.09 


0.870 


-0.480 


OLS 90% Trimmed View 


0.086 


0.182 


0.338 


LMSREG 70% Trimmed View 


0.143 


0.287 


0.428 



nonlinear. The cov.mcd estimator was used for trimming, and Figure 5 shows 
the estimated sufficient summary plots for SIR, PHD, SAVE (using 8 slices), and 
the OLS 90% trimmed view. The OLS trimmed view is the best, while SAVE 
completely fails. Figure 6 shows that for this data, the Imsreg trimmed view is 
very useful for visualizing g. The Imsreg estimator attempts to make the median 
squared MLR residual small and implements the PROGRESS algorithm described 
in Rousseeuw and Leroy (1987, pp. 197-204). Table 1 shows the estimated suffi- 
cient predictor coefficients b when the sufficient predictor coefficients are c(l, 2, 3)^. 
Only the OLS and Imsreg trimmed views produce estimated sufficient predictors 
that are highly correlated with the sufficient predictor. (The SAVE 40% trimmed 
view was also very good.) 

Figure 6 helps illustrate why the best Imsreg trimmed view worked. This 
view used 70% trimming, and the open circles denote cases that were trimmed 
while the highlighted squares are the untrimmed cases. Note that the highlighted 
cases are far more linear than the data set as a whole. Also Imsreg will give about 
half of the highlighted cases zero weight, further linearizing the function. In Figure 
6 the weighted Imsreg constant a^o is included, and the plot is simply the forward 
response plot of the weighted Imsreg fitted values versus y. The vertical deviations 

from the identity line are the “MLR residuals” yi — — iS^oXi and at least half 

of the highlighted cases have small MLR residuals. There exist data sets where 
OLS is better than Imsreg for showing curvature (see Cook et al., 1992), but, as 
illustrated by Example 5, Imsreg often performed better for single index models 
when m was smooth. 

A great deal of work remains to be done in the area of resistant dimension 
reduction. When a ID regression model holds, the trimmed views work at least as 
well as the untrimmed view since the untrimmed view corresponds to 0% trimming. 
The webpage (http://www.math.siu.edu/olive/) contains programs and a good 
introduction to ID regression models. 
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Robust Redundancy Analysis 
by Alternating Regression 

M.R. Oliveira, J.A. Branco, C. Croux and P. Filzmoser 



Abstract. Given two groups of variables redundancy analysis searches for lin- 
ear combinations of variables in one group that maximize the variance of 
the other group that is explained by each one of the linear combination. 
The method is important as an alternative to canonical correlation analysis, 
and can be seen as an alternative to multivariate regression when there are 
collinearity problems in the dependent set of variables. Principal component 
analysis is itself a special case of redundancy analysis. 

In this work we propose a new robust method to estimate the redun- 
dancy analysis parameters based on alternating regressions. These estimators 
are compared with the classical estimator as well as other robust estimators 
based on robust covariance matrices. The behavior of the proposed estimators 
is also investigated under large contamination by the analysis of the empirical 
breakdown point. 

Mathematics Subject Classification (2000). 62F35; 62J99. 

Keywords. Redundancy analysis, alternating regression, robust regression. 



1. Introduction 

The problem of finding relationships between groups of variables is central in 
multivariate analysis. A number of methods have been suggested to pursue this 
objective but canonical correlation analysis is by far the most commonly used. 
Given two sets of variables the goal of canonical correlation analysis is to construct 
pairs of canonical variates (linear combinations of the original variables, one in each 
set) such that they have maximum correlation. The correlations between canonical 
variates of the same pair are important to the study of the correlations between the 
two sets but they cannot be interpreted as the degree of relation between the sets 
of variables. In particular the squared canonical correlations represent the variance 
shared by the two canonical variates of the same pair but not the variance shared 
by the two sets of observed random variables. 
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To overcome the difficulty in using the squared canonical correlations as a 
measure of the shared variance between the two sets, a redundancy index was pro- 
posed by Stewart and Love (1968). This index is a measure of the proportion of the 
variance in one set y = (yi, . . . , yqY (dependent or criterion set) that is accounted 
for by the other set x = (xi, . . . ,XpY (independent or predictor set). The redun- 
dancy analysis model, proposed by van den Wollenberg (1977), searches for the 
linear combination, ui = a\x (the first redundancy variate), of the independent 
set that maximizes the redundancy index, i?y, defined as 

Q 

Ry = '^Corr{a^x,yjf/q, ( 1 . 1 ) 

under the restriction Var{a\x) = 1. The second redundancy variate, U 2 = 
is defined as the linear combination of the independent set, non-correlated with 
ui, that maximizes Ry under the restriction Var{ot 2 x) = 1. A maximum of r = 
min (p, q) redundancy variates can be sequentially defined, following the scheme as 
above. The objective of this paper is to provide robust estimators for a and Ry. 
Robust estimation is particularly useful in the analysis of multivariate data where 
the presence of outlying observations is common. 

In Section 2, the relationship between redundancy and canonical correlation 
analysis is highlighted. This will help to address the problem of robust estimation 
in redundancy analysis and in particular to introduce the robust method based 
on alternating regressions. The algorithm to estimate the redundancy variates by 
alternating regressions is also presented in Section 2, and in Section 3 a simulation 
study is developed in order to compare the performance of the proposed estima- 
tors with others based on robust covariance matrices. In the last section a brief 
discussion about the results obtained is made and links with canonical correlation 
analysis and suggestions for future work are discussed. 

2. Robust Estimation 

The classical solution to (1.1) comes down to an eigenvector/eigenvalue problem, 
as was shown by van den Wollenberg (1977). The coefficients a are the eigenvectors 
of the matrix R 1 YR 12 R 21 , where Rij (z, j = 1, 2) are the usual partition matrices of 
the correlation matrix associated with the two sets of variables. A simple approach 
to robust estimation is to robustify the correlation matrix and then apply the 
traditional methods of estimation. So given a robust estimate of the correlation 
matrix the eigenvectors of R1YR12R21 are calculated in order to estimate the 
coefficients a. This approach will be considered in Section 3. For the estimation 
of the robust correlation matrix we will use the M-estimator (M) as outlined 
by Maronna (1976). As an alternative we will consider the minimum covariance 
determinant (MCD) estimator of Rousseeuw (1985). 

The above approaches were studied for the first latent variate (redundancy 
variate) in Oliveira and Branco (2003). Additionally, a method based on robust al- 
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ternating regression was considered. This technique, originally suggested by Wold 
(1966), has recently been used in the context of robust factor analysis (Croux et ah, 
2003) and robust canonical correlation analysis (Branco et ah, 2005). Like Tenen- 
haus (1998) has pointed out, the algorithm is quite similar to the one proposed for 
canonical correlation analysis, initially discussed by Wold (1966), Lyttkens (1972) 
and later on mentioned by Tenenhaus (1998). For higher-order redundancy vari- 
ates, the algorithm is based on repeating the idea underlying the construction of 
the first redundancy variate in successive residual spaces. Since these spaces have 
reduced rank, problems occur in the regression procedure. 

In this paper we will focus on estimating higher order redundancy variates 
by robust alternating regressions. This procedure has a main advantage over the 
approach based on robust correlation matrix estimation: While the latter method 
discards an outlying observation completely, robust alternating regression is still 
using the information of the non-outlying cells for parameter estimation. For this 
reason, the method based on robust alternating regressions can also deal with 
missing values (Croux et al., 2003). 

The general idea of the algorithm takes advantage of the links between re- 
dundancy analysis and canonical correlation analysis. In the context of canonical 
correlation analysis, the redundancy index (1.1) can be written as in Rencher 
(1998) by: 

Q 

Ry = P^Y^Corr{0^y,yjflq, (2.1) 

1=1 

where p is the canonical correlation coefficient and /3 the canonical coefficient 
associated with the y's. Looking at the redundancy coefficient from these two 
different perspectives, an alternating procedure based on regression models can 
be built. To clarify this idea, let us consider that an initial value /3 is given. If 
the redundancy index is written in the form (2.1), to maximize Ry we need to 
determine the vector a that maximizes p^ = Corr{a^x,l3*y)‘^. From standard 
results on multiple regression, it follows that a is proportional to the regression 
coefficient a in the model 

/3*y = a*a: + 7i +ei. (2.2) 

This step is analogous to the procedure followed in canonical correlation analysis 
(Branco et al., 2005). 

Once a has been obtained, the value /3 has to be updated. Taking into account 
that Ry is a measure of the proportion of the variance in set y explained by 
Tenenhaus (1998) suggested that /3 can be chosen proportional to 6 = (6i, . . . , bqY, 
where its components are the coefficients of the regression equations 

Vj = bja^x + 72j + £2j, (2.3) 

that maximizes the variance of yj explained by ol^x^ i.e., bj is such that 
Corr(^yj,bja^x) is maximum. From regression standard results Corr{yj^yjY = 
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Corr{yj,bjCX^x)^ = Corr{yj,a^x)^. Choosing bj in this way leads to an updated 
value of /?, that maximizes the variance of yj explained by ot^x. By (1.1) 

where Ry is the value of the redundancy index obtained in the previous step. 

This scheme can be implemented using least squares regression, but it leads 
to non robust estimates. In order to robustify the parameters in redundancy anal- 
ysis we can simply use robust regression estimators. In the case of least squares 
estimators convergence of the algorithm is fast. However, using a robust estimator 
yields more local minima, and less smooth objective functions, so higher risks of 
lack of convergence appear. One benefits from choosing weighted Li regression 
estimators since they have bounded influence functions and the algorithm is fast 
to compute, once the weights are properly calculated. These weights are defined 
smoothly according to an appropriate distance measure associated with each obser- 
vation. Although, weighted Li regression is a good choice other robust regression 
estimators could be used. 

In the next subsection we give a general description of the algorithm to 
estimate the redundancy variates by alternating regressions. 

2.1. Algorithm 

It is convenient to center the observations, so that the intercept terms, 71 and 
72 j, can be eliminated from the equations (2.2) and (2.3). The observations are 
centered using robust estimators of location. In the present work, the median of 
each variable, (i, y), has been chosen for robust estimators of location. So, if X 
and Y represent the data matrices of size {n x p) and (n x g), respectively, the 
centered data are 

Xo = X-lpX^ and Yo = Y-lqy^, 
where 1^ is a A:- vector of ones. 

2.1.1. First Redundancy Variate. The algorithm starts by choosing an initial 
value, followed by the estimation of the regression coefficients in ( 2 . 2 ). 

Starting values: In the context of robust estimation, the starting values can be of 
crucial importance. For the classical version of the algorithm Tenenhaus (1998) 
suggested to use the vector = (1, 0, . . . , 0)^. However, we have built the start- 
ing value using the first robust principal component of Xq, Zi (see e.g. Croux and 
Ruiz-Gazen, 1996). Let b^^ {j = 1, . . . ,^) be the estimated regression coefficient 
associated with the model 

Voj =bjZi+esj, (2.4) 

where y^j is the column of Vo- The starting value is defined as 

/3W=6^°Vll&'°’ll. (2.5) 

where ^ = {b^i\ . . • , b^q^Y, and the corresponding latent variate is = Yq/3^^\ 




Robust Redundancy Analysis by Alternating Regression 



239 



Step s: Given the values (3^^ and (for s > 1), the is the estimated 

regression coefficient of the model 






+ £4- 



(2.6) 



The estimated vector of redundancy coefficients is 



aW = d«/ 






(2.7) 

and the associated redundancy variate is 

Given these new estimated values, and have to be updated. Let 

(j = 1, . . . , qi) be the estimate of the regression coefficient associated with the 
model 



Voj = + esj. 

So, the updated vector is defined as 



/ 3 (^) 



( 2 . 8 ) 



(2.9) 



where 6 * = (S^®* , . . . , )*, and the corresponding latent variate is =Y . 

Repeat this procedure until convergence of the algorithm, and let ai and 
ui be proportional to the last values of and respectively, where the 
proportional constant is such that U\ has norm equal to one. 

The first redundancy coefficient, Ryi, is estimated using a robust estimator of 
the correlation coefficient (in the present case, reweighed MOD estimator, RMCD, 
Rousseeuw and Van Driessen, 1999). 

^ 2/1 = Y^Corr{ui,yQjflq. ( 2 . 10 ) 

We recall that RMCD is the empirical covariance matrix computed from the subset 
of size h ^ 0.75n with the smallest determinant, where n is the sample size. 



2.1.2. Redundancy Variates of Higher Order. The next redundancy variate, U 2 , 
has to be uncorrelated with the previous one. In other words, we can say that Ui 
and U 2 have to be orthogonal. This restriction can be fulfilled if the data matrix, 
Xo, is projected into the space orthogonal to ui. Let us consider the following 
regression model 

Xo = UiC^-\-£q ( 2 . 11 ) 

where Xi = Xq — Xq are the corresponding estimated residuals. Due to stan- 
dard results from multiple linear regression, the residuals, Xi, are orthogonal to 
u\. Hence, repeating the procedure to obtain the first redundancy variate using 
the residuals, Xi, instead of the original data, Xq, we can guarantee that the 
next redundancy variate (a linear combination of the columns of Xi) is, in fact, 
orthogonal to Ui. 

However, this idea raises some difficulties, since Xi is orthogonal to Ui, and 
therefore, it has rank {p — 1), because rank (Xq) = p. In order to overcome this 
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collinearity problem we consider, as in Branco et al. (2005), the singular value 
decomposition 

Xi = UDV^ = U*D*V*\ (2.12) 

where D is a diagonal matrix of the p singular values, D* is diagonal matrix 
reduced by one row and column associated with the null singular value. A similar 
idea is used to define the matrices U* and V*. Using (2.12), we can define X* = 
U*^Xi = and this matrix of size {n x {p — 1)) has full rank. Finally, 

the procedure proposed to estimate the first redundancy variate can be applied 
to X* and yo- As opposed to the estimation method developed for canonical 
correlation analysis (Branco et al., 2005), in the redundancy analysis there is no 
need to transform yo- 

Let u* and a* be the results, after convergence, of the iterative procedure 
developed to estimate the second redundancy variate. These quantities are defined 
in residual spaces. Hence, a transformation to the original space has to be done, 
under the orthogonality condition, that is, U 2 = Xoa 2 has to be orthogonal to 
Ui. Let us consider the regression model 

u* = eui-\-£j, (2.13) 

and let u be the estimated residuals. This vector, u, is orthogonal to the previous 
estimated redundancy variate, Ui. Moreover, in order to express the redundancy 
variate as a linear combination of Xq with coefficients ot 2 we need to regress 
u on Xo 

u = Xof + es. (2.14) 

Then a 2 = kf, where k is such that U 2 = Xoa 2 has norm equal to one. 

To obtain redundancy variates of order I E {3, . . . ,r}, with r = min{p, g}, 
a similar procedure has to be developed, but to transform the estimates into the 
original space, instead of using (2.13), a slightly different regression model has to 
be considered 

n* =Ue + £9, (2.15) 

where U = [ni, . . . , ui-i]. This model guarantees that ui is orthogonal to ixi, . . . , 

Ul-i. 

2.2. Robust Alternating Regressions 

As suggested in Croux et al. (2003) and Branco et al. (2005), the regression models 
considered in the iterative procedure are estimated using weighted L\ regression, 
with weights u;^(X*_^) defined by 



- min 1, 



Ap*,0.95 

mxu) 






(2.16) 



where Xp*,o .95 upper 5% critical value of a chi-squared distribution with 

p* = p — 1 -f- 1 degrees of freedom (the number of columns of and 
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where is the column of X"l_i and Di is a robust distance. By the 

same reasons pointed out in Croux et al. (2003) and Branco et al. (2005) T and 
C were chosen to be MVE estimators of location and scatter, respectively. The 
regressions used to rewrite tt* in the original space were estimated using LTS 
regression (Rousseeuw, 1984). However, other robust estimators of regression can 
also be used. 



3. Simulation Study 



It is convenient to keep the simulation conditions as in Branco et al. (2005). By 
doing so, it is possible to compare the performance of the robust estimators based 
on alternating regressions not only with other robust estimators but also with 
the robust estimators of canonical correlation analysis based on alternating re- 
gression. This makes sense since the correlation matrices chosen are such that the 
redundancy analysis and the canonical correlation analysis lead to the same lin- 
ear combination of the observed variables, i.e. the true values for the canonical 
coefficients are equal to the true values for the redundancy coefficients. 

The performance of the robust estimators based on alternating regressions 
(RAR) are compared with the performance of the estimators based on robust 
correlation matrices (RMCD - /z 0.75n and M- Estimator (M), using Huber psi- 

function: w{s) = min ogs/l'^l)) ^ ^ classical estimator 

(Class). To do so, samples of size 500 were generated from four different distri- 
butions: normal distribution (NOR), with zero mean and covariance matrix E 
(A^(0,S)); symmetric contaminated normal (SCN) where each vector of observa- 
tions is generated from a normal distribution (AT(0,S)) with probability 0.9 and 
from A'(0,9S) with probability 0.1; t-distribution with 3 degrees of freedom (T3) 
and asymmetric contamination (ACN) where 90% of the data are generated from a 
N{0, E) and the other 10% of the observations are equal to the vector tr(E)l^p^^^. 

For each type of distribution and each estimation method m — 300 samples 
of 500 observations were produced. To assess the performance of each estimation 
method the following measures of MSE, were defined 



MSE(aO 

MSE(Ryi) 



^ m 

-Y. 

m ^ 



cos 



-1 






a"a 



i=i 



CXI 






1 m 2 

i=i 



where dc\ is the estimate of the redundancy coefficient, a/, and is the 
estimate of the redundancy index, Ryi. 

Table 1 shows the values of p and q that were considered, together with the 
correlation matrix between the two groups. The correlation matrix of each group 
was taken as the identity matrix. 
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Table 1 . Simulation setup. Rn = Ip and R22 = Iq> 



n m 
500 300 

500 300 

500 300 



P. 

2 

2 

4 



q Ri2 

“ Too 0^ 

[0 1/2 
[ 0.9 0 0 0' 

[0 1/2 0 0 

■ 0.9 0 0 O' 

01/200 
0 01/30 

0 0 01/4 



The results for the case p = q = 2 are summarized in Figure 1 and 2 (see 
Figure 5(a) to identify each contamination scheme). In this case, it can be said 
that the contamination influences the vectors of redundancy coefficients equally. 
ACN is the contamination scheme that produces more damages in the estimated 
values, and only the estimates of redundancy coefficients based on RMCD and 
RAR are robust for this kind of contamination. ACN has also a serious effect on 
the first redundancy index when the classical method is used. In the case p = 2 
and q = A something quite similar happens, except that RAR reveals to be even 
better to estimate the first redundancy coefficient. Because of lack of space the 
figures associated with this case are omitted. 
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Figure 1. Mean square error of the redundancy coefficients, p = 2, 
q = 2. 
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Figure 2. Mean square error of the redundancy indexes, p — 2, q = 2. 
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For p = q = A we obtain very similar results for the first redundancy coeffi- 
cient and index as for p = g = 2. However, the effect of the contamination increases 
with the order of the redundancy coefficients (see Figures 3, 4, and Figure 5(a) to 
identify each contamination scheme). The method based on alternating regression 
performs slightly worse than RMCD for the two last redundancy coefficients {as 
and a 4 ), and for the two last redundancy indexes {Rys and Ry 4 ). 
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Figure 3. Mean square error of the first and last redundancy coeffi- 
cients, p = 4, ^ = 4. 
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Figure 4. Mean square error of the first and last redundancy indexes, 
P = 4, 9 = 4. 
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Figure 5. (a) On the left - legend of Figures 1, 2, 3 and 4. (b) On 
the right - legend of the figures summarizing the performance and the 
empirical breakdown point of the various estimators. 
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3.1. Empirical Breakdown Point 

Another criteria to compare the performance of the estimators is based on the 
empirical breakdown point. To do so, a simulation study was carried out, with the 
same setup that was used in Branco et al. (2005), where each group has 3 variables 
(^p = q = 3). For each sample, (100 — e)% of the points were generated from a 
normal distribution with zero mean and correlation matrix, R, with jRn = I 3 and 
R 22 = I 3 and 

" 0.9 0 O' 

Ri2 = 01 / 20 , 

_ 0 0 1/3 _ 

and the other e% of the observations are equal tr(S)l^ (in the present case, 
tr(E) = 6). The values of e were chosen from zero (no contamination) to 25 
(25% of contamination), i.e., e G {0, 1, ... , 25}. As before we chose n = 500. The 
procedure was repeated 200 times for each estimation method. The results, for the 
first and last redundancy coefficients and indexes, are summarized in Figures 6 and 
7 (see Figure 5(b) to identify each estimator considered). In Figure 7(a), the MSE 
associated with the M, RMCD and RAR estimators seems to be equally low due to 
the apparent large magnitude of the MSE of the Class estimator. Nevertheless, the 
empirical breakdown point for Ry 2 and Ry 3 (see Figure 7(b)) show that the RAR 
has lower empirical breakdown point than RMCD but higher than the estimators 
based on the M-estimator (M). Similar conclusions apply to ai, a 2 and as, as 
showed in Figure 6. 
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Figure 6. Empirical breakdown point, redundancy coefficients. 
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Figure 7. Empirical breakdown point, redundancy indexes. 
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4. Discussion and Future Work 

The RAR procedure produced useful results and it has the advantage that it can 
be carried out in the case of more variables than observations. It is also capable 
of dealing with missing values and it can cope with outlying cells (Croux et ah, 
2003). Moreover, RAR aims at directly maximizing the redundancy index, and 
is therefore in a way intuitively more appealing than a covariance matrix based 
procedure. However, the search for other robust estimators should be pursued. In 
Oliveira and Branco (2003) various robust estimators have been studied for the 
first redundancy variate, where the estimators based on projection pursuit revealed 
promising results. This estimator has to be developed for higher order variates. 

The simulation developed in this study was designed to facilitate the com- 
parison with the study in Branco et al. (2005). So far, it can be said that the 
behavior of RAR is similar in both canonical correlation analysis and redundancy 
analysis. Having resolved the problem of robust ifying (by alternating regression) 
the canonical correlation analysis and redundancy analysis we are compelled to 
consider the generalization of these two methods proposed by DeSarbo (1981). 

As it was pointed out by van den Wollenberg (1977), principal components 
analysis is a special case of redundancy analysis. If we choose the dependent vari- 
ables, y, equal to the independent variables x, the principal components differ 
from the redundancy variates by a constant term. In principal components anal- 
ysis, we search for a linear combination of the a;, a, that maximizes a^Ra and 
a^a = 1 {R = Corr{x)). In redundancy analysis we seek for a linear combination 
of a?, a, that maximizes a^R^ot^ where ol^Rol = 1. Both the solutions of the two 
problems have the directions of the eigenvectors of R. Therefore, the algorithm 
based on alternating regressions can be used to estimate principal components. It 
would be interesting to compare this approach with other suggestions made in the 
literature, like the procedure based on alternating regressions to estimate principal 
components proposed in Wold (1966), and other robust estimators, e.g., based on 
projection pursuit. 
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Robust ML-estimation of the 
Transmitter Location 

E. Ollila and V. Koivunen 



Abstract. Location of a radio transmitter may be estimated using the trian- 
gulation principle and angular measurements from at least two sensor arrays. 
For example, two base-stations equipped with antenna arrays observe the 
waveforms emitted by a transmitter such as a mobile phone. Based on the 
estimates of angle-of- arrival (AoA) of waveforms impinging the antenna array 
at base-stations, an estimate of the location of the transmitter can be formed 
using elemental geometry. In this paper, we discuss the ML-estimation of the 
location of the transmitter by modelling the distribution of the AoAs by two 
well-known angular distributions, namely the von Mises and the wrapped 
Cauchy distribution. These distributions are well justified by the physical 
measurements on radio wave propagation. Since the received signals at the 
base-stations are corrupted by noise which may be impulsive in nature, robust 
estimation of the signal parameters is an important issue. 

Mathematics Subject Classification (2000). Primary 62F35; Secondary 62F25. 
Keywords. Angular data, signal processing, ML-estimation, robustness. 



1. Introduction 

Location estimation using triangulation principle has numerous applications, for 
example in wireless communications, electronic warfare and navigation. In such 
problems, the angle of arrival (AoA) of the wavefield emitted by a transmitter may 
be estimated using observations acquired by a sensor array. Using the knowledge 
on the locations of the sensor arrays and the estimates of the angles of arrivals of 
the waveforms impinging the arrays, the location of the emitter may be estimated. 
See, e.g., Krim and Viberg (1996) for a nice tutorial on estimating AoA’s using 
sensor arrays. 

As a motivating practical example, envisage that zi and Z 2 represent the 
locations of two base-stations of cellular phone system such as GSM and the un- 
known z is the location of the mobile phone of a subscriber. An accurate esti- 
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(x.y) 



Figure 1. The location of the transmitter z = {x^y)^ is esti- 
mated using the angles of arrival a and j3 and the known locations 
of the sensor arrays zi = (xi,yi)^ and Z 2 = (x 2 ,y 2 )^- 



mate of the location of the phone may be needed in case of emergency calls or 
just to provide subscribers information about different services available in the 
neighborhood of the current location. The base-stations equipped with antenna 
array are programmed to produce estimates of the (azimuth) AoA of the signals 
the phone is transmitting. It is often sensible to model the AoA measurements 
©n = • • • , ^n} and Q.n = {^^ 1 , • • • , obtained from the base-stations during 

n time steps as i.i.d. random deviates symmetric about the true angles a and /5, 
respectively (given the transmitter at z has been static during this measurement 
interval). The received signals at base-stations are corrupted by environmental 
noise plus receiver noise. Especially man-made noise is often impulsive in nature 
(see Middleton, 1973) and therefore robust estimation of the signal parameters 
(such as AoA’s of the signals or the location of the transmitter) based on the sen- 
sor array output vectors (snapshots in engineering jargon) is of major importance. 
Hence, robust estimation of the transmitter location is an important issue. 

If the location of the points zi == {xi,yi)^ and Z 2 = {x 2 ,y 2 )^ and the angle 
a between zi and a point z and angle j3 between Z 2 and a point z are known, 
then the “unknown” z = (x, yY can be written as the crossing point of the lines 
with angles a and P originating from Zi and Z 2 , i.e., z [x{a, P)^y{a, p)Y . The 
above scenario becomes an estimation problem of z if the angle a is unknown 
and measurements, say 0i, . . . , 0^, of a are provided. Similarly /3 is unknown and 
measurements, say cji, . . . ,o;n, are provided. Figure 1 illustrates the geometry of 
the measurement system. 

In this paper, we model the given data sets 0n = {^i, • • • ,^n} and = 
{cji, . . . ,cjn} as i.i.d. (and also mutually independent) random deviates from an- 
gular distribution symmetric about a and P, respectively. We use two well-known 
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angular distributions, namely the von Mises (VM-)distribution and the wrapped 
Cauchy (WrC-) distribution as our models. The role of the former distribution 
for angular measurements is similar to that of the normal distribution for mea- 
surements on the line. The latter distribution is an alternative to the classical 
VM-distribution and its MLE of the symmetry center is a robust alternative for 
the “mean direction”, i.e., the MLE of the VM-distribution. These model distribu- 
tions are briefly reviewed in Section 3. We refer to monographs by Mardia (1972) 
and Jammalamadaka and SenGupta (2001) for a detailed account on angular mod- 
els. A practical application oriented motivation for this study and the used model 
distributions is given in Section 2, where the statistics of wireless communication 
channels are briefly described. 

Let a and /3 denote the ML-estimates of the angles a and j3 calculated 
from the samples 0^ and fin, respectively. The MLE of the location is then 
z = [x(d,^),2/(o;,/3)]^. Throughout the paper we assume that f3 ^ a and (3 7^ 
(a -h 7r)mod(27r). This restriction is due to estimation technique since otherwise 
the lines originated from zi and Z2 are parallel and do not cross. Note also that the 
estimation technique leads to the following ambiguity: the estimated angle pairs 
(d,^) and {{a -h 7r)mod(27r), /3) will lead to the same location estimate. See Krim 
and Viberg (1996) for a brief description on different antenna array configurations 
and ambiguities associated with them. 

In this paper we investigate the statistical properties of z: large sample prop- 
erties and an estimator of standard errors are derived using ML-theory. Moreover, 
accompanying the estimate and its standard errors, also the 100p% likelihood con- 
tours are easily provided. This is discussed in Section 4. We will demonstrate 
the properties (robustness, variability) of the ML-estimators of the location based 
on VM- and WrC-distribution using Monte-Carlo studies. This is the subject of 
Section 5. Finally, Section 6 concludes. 



2. Statistics of Wireless Communication Channels 

During the recent years a significant research interest in the area of wireless 
telecommunications has focused on modelling the statistics of indoor and out- 
door mobile radio channels (Middleton, 1973; Pedersen et ah, 2000; Spencer et al., 
2000). The modelling then includes, for example, the pdf of the azimuth AoA and 
time delay of the impinging waveflelds as well as their expected power. Several 
experimental measurements are available to which we can compare the validity 
of the angular model distributions used in this paper. For example, Pedersen et 
al. (2000) have published a number of outdoor measurement results collected in 
a macrocellular typical urban environment. Their measurements were conducted 
at 1.8 GHz in Stockholm, Sweden and Arhus, Denmark, using an eight element 
Uniform Linear Antenna (ULA) array at an elevated base-station (the base-station 
antenna is located above the surrounding buildings). The signal parameters (e.g., 
the AoA’s, delays, etc) of the impinging waveflelds are estimated using the array 
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snapshots. In Fig. 10 of (Pedersen et al., 2000), the empirical pdf of the AoAs is 
depicted. The authors then suggest to model the pdf by the Gaussian distribution. 
They however note that the Gaussian function does not provide a good match for 
the empirical pdf at the tails of the distribution which are heavier than those of 
the Gaussian distribution. The empirical pdf indicates that the VM-distribution 
may provide a better fit as its shape is close to Gaussian at the center but its tails 
are heavier on the ends of the angular interval. Indeed, the Gaussian function fails 
since it is a valid pdf for measurements on the line; thus assigning probability mass 
outside the angular interval (0^,360°]. 

Spencer et al. (2000) have published a number of mdoor measurements for two 
different buildings. Their measurements were conducted at 7 GHz and scanning 
was done mechanically with antenna beamwidth 6°. In Fig. 12 of Spencer et al. 
(2000) the empirical pdf of the AoA’s is depicted. The authors note that the 
Laplacian function gives a good match except at the tails of the distribution which 
are heavier than those of the Laplace distribution. The empirical AoA pdf in this 
case indicate that the WrC-distribution may provide a better fit than the Laplace 
distribution. 

The outdoor measurements in Pedersen et al. (2000) and the indoor measure- 
ments in Spencer et al. (2000) thus suggest a different angular model for the AoA 
distribution (i.e., the VM- and WrC-distribution, respectively). One may speculate 
whether these distributions have their origin in characterizing the true distribu- 
tion of the AoA’s for indoor and outdoor mobile radio channels respectively, or 
whether they are artifacts of different measurement systems and of the algorithms 
used to extract the signal parameters. Nonetheless, these physical real-world mea- 
surements validate that the angular distributions considered in this paper are 
reasonable models for the AoA’s of the indoor and outdoor mobile radio channels. 

3. Angular Model Distributions 

3.1. Von Mises Distribution 

VM-distribution plays a crucial role in statistical inference for angular measure- 
ments and its role is in many ways the same as that of the normal distribution 
for measurements on the line. VM-distribution is also referred as “Circular Nor- 
mal distribution”, e.g, in Jammalamadaka and SenGupta (2001), to highlight its 
importance and similarities with the normal distribution on the line. 

An angular random variable 6 G (0, 27 t] is said to have von Mises distribution, 
denoted by VM(a, /^), if its pdf is given by 

/(6>; a, k) = exp{« cos{6 - a)}, (3.1) 

where io(/^) is the modified Bessel function of the first kind and order zero, i.e., 
^o(/^) = The parameter a is the symmetry center and the pa- 

rameter K> 0 is the concentration parameter: the larger the /^, the greater is the 
concentration about the center a. In the limit ^ oo, (3.1) approaches point 
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mass distribution at a and conversely, in the limit 0, (3.1) approaches the 
uniform distribution on the circle. Computation of the ML-estimates of the VM- 
distribution is discussed in Jammalamadaka and SenGupta (2001). 



3.2. Wrapped Cauchy Distribution 

The Cauchy distribution is a well-known alternative to the classical normal distri- 
bution and its MLE of the symmetry center serves as a robust alternative for the 
mean. The wrapped Cauchy distribution is an alternative to the classical von Mises 
distribution for modelling symmetric angular data and its MLE of the symmetry 
center serves as a robust alternative for the mean direction, i.e., the MLE of the 
center of symmetry of the VM-distribution. The WrC-distribution is simply the 
distribution generated by wrapping the Cauchy distribution on the line around the 
unit circle, or in other words, expressing the latter mod(27r). An angular random 
variable 6 G (0, 27t] has wrapped Cauchy distribution, denoted by WrC(o;, p), if its 
pdf is of the form 






1 ^ ~ 

27t 1 -h — 2pcos(0 — a) ’ 



(3.2) 



where the parameter a represents the symmetry center and p represents a shape 
parameter with 0 < p < 1. When p ^ 1, (3.2) approaches point mass distribution 
at a and when p = 0, (3.2) is simply the uniform distribution on the circle. A 
novel iterative algorithm to calculate the ML-estimates of WrC-distribution was 
proposed by Kent and Tyler (1988). 



4. ML-estimation of the Transmitter Location 



As noted earlier, the location z = (x, y)^ of the transmitter can be written as the 
crossing point of the lines with angles a and j3 originating from the location of the 
base-stations zi = (^ 1 ,^ 2 )^ and Z 2 = (a^ 2 ,y 2 )^, mathematically. 



X = x(a, /3) 



y = y{a,p) 
where, by definition, 



2/2 - 2/1 + tan(a)xi - tan(/J)x 2 
tan(o;) - tan(^) 

tan(o;)i /2 — tan(/?)i/i — tan(o;) tan(^)(x 2 — Xi) 
tan(a) — tan(/?) 



a = a{x,y)=tsLn ^ {{y - yi) / {x - xi)} , 
!3 = (3{x,y) = i&\r'^{{y-y 2 )l{x~X 2 )}. 



(4.1) 

(4.2) 



(4.3) 

(4.4) 



Recall that we model the given data sets = {9i^ . . . ^6^} and = {^1 5 • • • 5 ^n} 
as i.i.d. (and also mutually independent) random deviates from angular distribu- 
tions symmetric about a and /?, respectively. Let a and /3 denote the MLEs of the 
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angles a and (3 calculated from the samples On and fin, respectively. Then the 
MLE of the location z of the transmitter is simply 

x] \x(a,3) 

We denote the symmetric angular distribution of 0^’s by Fi, which is parameterized 
by a, the symmetry center of the distribution, and ti (say), the shape parameter 
of the distribution. Similarly, denote the angular distribution of cj^’s by F 2 , which 
is parameterized by (3^ the symmetry center of the distribution, and T 2 (say), 
the shape parameter of the distribution. Fi and F 2 need not to be same angular 
distributions. 

Next consider the asymptotic distribution of the MLE z = (x, yY in the case 
that the shape parameters t\ and T 2 of the angular distributions are known. Then 
under sufficient regularity conditions. 



j— OL OL D -.j 



7i(a)-i 0 1\ 

0 hipr^i) 



where / 1 (a) and hiP) denote the expected information function for a single ob- 
servation from Fi and F 2 . If the above holds, then the MLE z has a limiting bi- 
variate normal distribution with the variance-covariance matrix /(x,y)~^, where 
I{x^y) = I{ol^(3)Q is the expected information function for (x,y), I{a,/3) = 

diag{/i(a),/ 2 (^)} and 

\da{x,y)/dx da{x,y)/dy'\ _ r-sin(a)/ri cos(a)/ri] . . 

[dP{x,y)/dx d(3{x,y)/dy\ sin(/?)/r 2 cos(/3)/r2j ’ ^ ^ 

where ri,r 2 denote the Euklidean distances between z and zi,Z 2 , respectively. Now, 
denote 

y]. (4.6) 

^ xy ^ y 

Then we obtain the following expressions for its components: 



2 1 f 2 h{P)-^ 

(tan(a) — tan(/3))2 cos^(q;) 

2 ^ tan^(a)tan^(/3) f 2 -^i(o:)~^ 

^ (tan(o;) - tan(/3))2 \ ^ sin^(o) ^ sin^(/3) 

2 tan(o:) tan(/3) f 2 -^2(7)"^ 

(tan(a) — tan(/3))2 sin(2a) sin(2/?) 




Note that for the existence of the covariance matrix, we also need to assume that 
0L,(3 ^ {0, 7t/2, 7T, 37t/2, 27t}. 

From the expression of /(x, y)~^ we may conclude the following: (A) the ac- 
curacy of the location estimate decreases when the distance r\ (and/or r 2 ) between 
the transmitter z and the base-station zi (and/or Z 2 ) increases; (B) the better the 
accuracy of the angle estimates a and /3 (i.e., the smaller their asymptotic vari- 
ances /i(q!)“^ and l 2 {P)~^ are), the better is the accuracy of the location estimate. 
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As an estimator of the covariance matrix of z = (x, y)^ we may use the inverse of 
the observed information function where 

J{x,y) = Q^dmg{Ma),J20)}Q. (4.7) 

Here we evaluate Q at the MLE to obtain (5, and Ji{a) is the observed information 
function for the sample 0^ from Fi evaluated at the MLE a. Similarly, J2{$) is 
the observed information function for sample from F2 evaluated at the MLE /3. 

Next consider the case that the shape parameters ti and T2 are unknown. Un- 
der sufficient regularity conditions, y/n{a—a, fi — ri)^ and y/n{$~l3, T2—T2Y have 
a limiting bivariate normal distribution. If I\2 {oi-^ti) = E[—d^logfi/dadri] = 0 
and /i2(/5, T 2 ) = E[-d‘^ log f2/ 8^872] = 0, then the ML-estimates a and fi and 
the ML-estimates /3 and T 2 are asymptotically independent and consequently as- 
ymptotic behavior of z remains unchanged if ti and T 2 are known parameters or 
to be estimated from the data. This holds for the model distributions considered 
in this paper. Thereby, if ti and T2 are unknown, the covariance matrix of z is 
estimated by J{x,y)~^ given in (4.7) with the exception that Ti and T2 in the 
resulting formulae are replaced by their ML-estimates. 

Finally, we note that accompanying the MLE z and its standard error esti- 
mates, also the related 100p% likelihood contours (i.e., the approximate 100(1— p)% 
confidence regions; see Kalbfleisch (1985, Sec. 11.5) can easily be calculated. In 
the case that ri and T 2 are known parameters, the log relative likelihood func- 
tion R(x, y) may be calculated due to invariance of the likelihood functions under 
reparameterization as follows: R(xq,2/o) of any pair of values (xo,yo) is found by 
computing the corresponding values ao = a(xo,yo) and f3o = Pi^o^Vo) and then 
equating R(xo,yo) with Ri(o;o) + R 2 (/?o), where Ri(a) and R2{l3) are the log rela- 
tive likelihood functions for a sample 0^ from Fi and a sample Qn from F2. Then 
the desired 100p% likelihood contours may be plotted on the x — y plane. Often, a 
computationally more feasible solution is to calculate the normal approximation of 
the likelihood contours. In the case that ti and T2 are unknown, the related 100p% 
maximum likelihood contours (Kalbfleisch, 1985), i.e., the profile likelihoods using 
the terminology of Lindsey (1996), can be calculated in a similar fashion. 



5. Simulation Studies 

5.1. Simulation Arrangements 

We now explain the simulation setup with an example. The transmitter is lo- 
cated at z = (200, 300)^ and the base-stations are at Zi = (20, 100)^ and Z 2 = 
(100,400)^. This yields a = 0.838 and (3 = 5.498 for the angles (measured in 
radians). We generated samples 0i, . . . , and o;i, . . . ,o;n from VM-distributions 
VM(q!,/^i) and VM(/?, ^^ 2 ), respectively, with shape parameters = 200 and 
/^2 = 50. The sample size is n = 15. Prom the simulated samples we calculated the 
MLE’s of the angles a and j3 and the location z. In Figure 2, the configuration 
of the transmitter and the base-stations are shown together with the MLE of the 
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transmitter. The estimates for the covariance matrix of z, calculated hj J{x,y) ^ 
were 

26.195 1.971] ["27.492 0.711" 

1.971 27.216J [ 0.711 28.41oJ ’ 

where the last matrix correspond to case that k>i and k .2 were estimated from the 
data. 




» K » m ■» I4Q 'fiO JH) 2M 



Figure 2. The location of the transmitter is indicated by a filled 
circle and its MLE by an asterisk. The location of the base-stations 
are indicated by filled squares. 

Figure 3a depicts the 1%, 10% and 50% likelihood contours and the normal 
approximations {hii and k >2 are taken to be known). Note that the normal approx- 
imation provides a good approximation despite of the small sample size (n = 15). 
In the case that k>i and ^2 are not known, the relevant summary about (x, y) is 
obtained by the 100p% maximum likelihood (profile likelihood) contours which are 
depicted in Figure 3b. 




Figure 3. Figure a) depicts the 1%, 10% and 50% likelihood 
contours and their normal approximations (dotted lines). Figure 
b) depicts the 1%, 10% and 50% maximum likelihood (profile 
likelihood) contours. The location of the transmitter is indicated 
by the star and its MLE by the triangle. 
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5.2. Study 1: Estimation of the Accuracy of the Location Estimator 

As described in Section 4, the accuracy (the variance-covariance matrix) of the 
VM-estimate or WrC-estimate of the location can be conveniently estimated by 
J{x^y)~^. In the present simulation study we investigate the performance of this 
estimator of accuracy for small sample sizes and also compare its performance to 
non-parametric estimators of the variance-covariance matrix based on the princi- 
ples of Jackknife and Bootstrap. We refer to Efron and Tibshirani (1993), Shao 
and Wu (1989) and Shao (1989) for a comprehensive treatment on the Bootstrap 
and the Jackknife. 

Let a and /3 be the angle estimates based on the samples , . . . , 0^ and 
o;i, . . . ,o;n from symmetric angular distributions Fi and F 2 , and let z = 
be the corresponding location estimate. Then calculate, say M bootstrap samples 
from each of the data sets. Prom the M bootstrap samples calculate the estimates 
ai and f3i and the location estimate Zi = {x{ai, (3i)^y{ai, f3i))^ . The estimator 
of the variance-covariance matrix of z = {x^y)^ based on the bootstrap is then 
defined as 

1 ^ 

BS(x, y) = j^ 

^ i=l 

where z* = z^ Let and /3(^) be the estimates of a and f3 after 

deleting observations 6i and coi from the samples, and let Z(^) = (x(d(^),/3(^)), 
2 /(d(i), /3(^)))^ be the corresponding location estimate. The estimator of the 
variance-covariance matrix based on the Jackknife is then 

JK{x,y) = ^(z(i) - %•))(%*) - 

^ i=i 

where Z(.) = n~^Yd=iki)- 

The means of the standard error estimates of the VM-estimates and WrC- 
estimates of the location z = {x,y)^ for 2000 simulated samples of size n = 15 
and n = 100 from VM- and WrC-distribution are tabulated in Table 1. Standard 
deviations are reported in parentheses. Also the Monte Carlo standard error es- 
timates are reported, i.e., the sample standard deviations of x and y of the 2000 
simulated data sets. The parameters of the sample distributions were a = 0.838 
and (3 = 5.498, = ^2 = 200 and pi = p 2 = 0.8. We assume that the shape pa- 

rameters of the distributions are unknown and thus they are also estimated from 
the data sets. A reasonable choice of estimator to use in practice is then the one 
that is consistently close to the Monte Carlo estimate. We see that the J(x,y)~^ 
estimator has the best performance (i.e., smaller bias, smaller variability) when 
the samples are from the true model distribution, but is clearly biased when this 
is not the case. The BS-estimator is also performing well except in the case that 
the samples are from WrC-distribution and the sample size is small (n = 15). This 
is due to the poor robustness properties of the Bootstrap (Stromberg, 1997). The 
JK-estimator however does not seem to have this problem. The simulation thus 
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Table 1. Monte Carlo standard errors of VM- and WrC-est- 
imates of (x, y) together with means of the various standard error 
estimators for the 2000 simulated samples of sample size n = 15 
and n = 100. Standard deviations are reported in parentheses. 



Esti- 


sample 


sample 


Monte 


Estimator of Standard Error 


mate 


size 


distr. 


Carlo 


J{x,y) 1 


BS{x,y) 


m{x,y) 


VM 


15 


VM 


3.806 


3.732 (0.60) 


3.727 (0.64) 


3.985 (0.77) 








3.962 


3.822 (0.60) 


3.812 (0.64) 


4.057 (0.79) 






WrC 


32.48 


38.14 (14.10) 


84.51 (847) 


32.90 (12.65) 








31.75 


37.25 (11.50) 


83.79 (777) 


32.24 (10.05) 




100 


VM 


1.493 


1.495 (0.09) 


1.493 (0.12) 


1.508 (0.11) 








1.534 


1.529 (0.09) 


1.524 (0.12) 


1.543 (0.11) 






WrC 


11.55 


14.08 (1.67) 


11.47 (1.50) 


11.39 (1.42) 








11.78 


14.34 (1.38) 


11.63 (1.27) 


11.64 (1.21) 


WrC 


15 


VM 


4.838 


3.628 (1.09) 


5.340 (1.56) 


5.329 (2.54) 








5.133 


3.720 (1.08) 


5.469 (1.55) 


5.442 (2.56) 






WrC 


20.43 


18.80 (7.42) 


144.8 (2168) 


21.48 (10.67 








20.33 


18.86 (6.74) 


137.8 (2102) 


21.58 (9.76) 




100 


VM 


1.914 


1.491 (0.16) 


1.962 (0.31) 


1.961 (0.33) 








1.963 


1.525 (0.16) 


2.003 (0.31) 


2.013 (0.33) 






WrC 


6.796 


6.790 (0.89) 


7.025 (1.13) 


6.906 (1.12) 








6.957 


6.948 (0.84) 


7.180 (1.08) 


7.079 (1.10) 



suggest to use the JK-estimator instead of J{x^y)~^ if one suspects violations in 
the model assumptions. Since the BS-estimator is clearly biased for samples from 
heavy tailed distributions (such as WrC) with small sample sizes, we favor the 
JK-estimator over the BS-estimator. 

5.3. Study 2: Robustness of the Location Estimator 

The geometry of the transmitter and the base-stations are as in the earlier example 
of Section 5.1. We generated a pair of samples 0i, . . . , and a;i, . . . , from the 
6-contaminated VM-distributions CVM(e, a, and CVM(e, /?, /^ 2 ), respectively, 
using Ki = 200 and /^2 = 50. The CVM(^, a, k) is defined so that 100(1 -£)% of the 
observations are good observations from VM(o;, /^) and 100e% are bad observations 
from VM(27 t — a, k). Note that this kind of situation may arise in real-world. For 
example, in the case that the base-stations receive desired signal at 100(1 — e)% 
of the time but “listen” to a different transmitter (interferer) 100^% of the time. 
The sample size was n = 50 and the number of sample pairs was m = 500. We 
chose e = f). 2. From the ith simulated sample pair we calculated the WrC- and 
VM-estimates (ai,f3i) of the angles (o,/?) and the corresponding estimates of 
the location z. This procedure was repeated for all i = 1, . . . , m pairs of samples. 
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Figure 4. Plots of the WrC- (left) and VM-estimates (right) of 
the angle j3 calculated from 500 simulated samples from CVM- 
distribution. Also the true angle value is plotted (the line extend- 
ing over the unit circle). 




Figure 5. Plots of the WrC- (left) and VM-estimates (right) of 
the location z calculated from the 500 simulated samples from 
CVM-distribution. The location of the base-stations zi and Z 2 
are indicated by filled squares. The crossing point of the dashed 
line is at the true location of the transmitter z. 



We visualized the bias and the variability of the diflFerent estimators by plot- 
ting the calculated angle estimates and f3i, i = 1, . . . ,m, as radii on the unit 
circle. Figure 4 demonstrates that the VM-estimates of the angle are greatly af- 
fected and drawn towards the contaminated observations. The WrC-estimates are 
very robust: their variability is small and they are distributed around the true 
angle. Plots of the corresponding location estimates are shown in Figure 5. This 
confirms the findings for the angle estimates: the WrC-estimates of transmitter 
location are robust whereas the VM-estimates are highly sensitive in the face of 
outlying observations. 
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6. Final Remarks 

In this paper we considered the ML-estimation of the location of a transmitter 
based on the Angle-of- Arrival data provided by the base-stations and modelled 
by angular distributions. We used von Mises and wrapped Cauchy distributions 
which can be motivated by real-world measurements as discussed in Section 2. The 
transmitter location was then estimated using the MLE of the angles and elemental 
geometry. We demonstrated that accompanying the MLE, also its standard errors 
and likelihood regions (approximate confidence regions) can easily be calculated. 
Our simulation demonstrated that the WrC-estimate of location is robust. 

References 

[1] B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 
New York, 1993. 

[2] S.R. Jammalamadaka and A. SenGupta, Topics in Circular Statistics. World Scien- 
tific Press, Singapore, 2001. 

[3] J.G. Kalbfleisch, Probability and Statistical Inference Volume 2: Statistical Inference. 
Springer- Verlag, New York, 1985. 

[4] J.T. Kent and D.E. Tyler, Maximum Likelihood Estimation for the Wrapped Cauchy 
Distribution. J. Appl. Stat. 15 (1988), 247-254. 

[5] H. Krim and M. Viberg, Two Decades of Array Signal Processing: The Parametric 
Approach. IEEE Signal Proc. Mag. 13 (1996), 67-94. 

[6] J.K. Lindsey, Parametric Statistical Inference. Clarendon Press, Oxford, 1996. 

[7] K.V. Mardia, Statistics of Directional Data. Academic Press, London, 1972. 

[8] D. Middleton, Man-made Noise in Urban Environments and Transportation Systems: 
Models and Measurements. IEEE Trans. Commun. 21 (1973), 1232-1241. 

[9] K.I. Pedersen, RE. Mogensen, and B.H. Fleury, A Stochastic Model of the Tem- 
poral and Azimuthal Dispersion Seen at the Base Station in Outdoor Propagation 
Environments. IEEE Trans. Veh. Tech. 49 (2000), 437-447. 

[10] J. Shao, The Efficiency and Consistency of Approximations to the Jackknife Variance 
Estimators. J. Amer. Statist. Assoc. 84 (1989), 114-119. 

[11] J. Shao and C.F.J. Wu, A General Theory For Jackknife Variance Estimation. Ann. 
Statist. 17 (1989), 1176-1197. 

[12] Q.H. Spencer, B.D. Jeffs, M.A. Jensen, and A.L. Swindlehurst, Modeling the Sta- 
tistical Time and Angle of Arrival Characteristics of an Indoor Multipath Channel. 
IEEE J. Sel. Areas Comm. 18 (2000), 347-360. 

[13] A.J. Stromberg, Robust Covariance Estimates Based on Resampling. J. Statist. 
Plann. Inference 57 (1997), 321-334. 

E. Ollila and V. Koivunen 

SMARAD CoE, Helsinki University of Technology, P.O. Box 3000 
Fin-02015 HUT, Finland 

e-mail: {esa. ollila, visa.koivunen}®hut .f i 




Statistics for Industry and Technology, 259-269 
© 2004 Birkhauser Verlag Basel/Switzerland 



A Family of Scale Estimators 
by Means of Trimming 

J.F. Ortega 

Abstract. A very common practice in literature is that of building scale esti- 
mators by means of a location estimator. The most common scale estimator 
used is the standard deviation, which is obtained by using the mean; its use is 
not the most suitable one due to its weakness under the presence of outliers. 
We can find other scale estimators, based on location estimators, which are 
more resistant under the presence of outliers, as for instance, the so called 
Mad, which uses for its construction the median. 

Since the mean and the median can be considered extreme elements of a 
location estimators family known as trimmed means, in this paper we propose 
a scale estimator family called a(5. Trimmed. For its definition, we will use the 
above mentioned trimmed means family and different parameters a and p, 
which are called trimming levels. We will demonstrate the good robustness 
behavior of the elements of such a family. It will be proved that they are 
affine equivariant, and (depending on the trimming levels) have a high exact 
fit point, a high breakdown point and a bounded sensitivity curve. 

Mathematics Subject Classification (2000). Primary 62G35; Secondary 62G05. 
Keywords. Robust scale estimators, robustness, efficiency. 



1. Introduction 

In certain cases, scale estimators are defined by using a location estimator as 
the basis for their construction. Generally speaking, the behavior of the scale 
estimators under the presence of outliers is inherited by the behavior of the location 
estimators used. 

The most commonly used scale estimator is the square root of the variance, 
which is built using the mean. Regarding the mean, we know that just one obser- 
vation of the sample is able to influence its calculation, in such a way that it can 
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“lead” it to any desirable value. Therefore, other scale estimators have been de- 
fined by using other location estimators, which are more resistant in the presence 
of outliers (for instance, Rousseeuw and Croux, 1993). 

The median (Med) is an estimator with good robustness properties (it is 
equivariant^ with a maximum breakdown point and bounded influence function^ 
that is, B-Robust), so that it has been used in different papers for the construc- 
tion of scale estimators. The most well known is the median absolute deviation 
{Mad) given by Hampel et al. (1986) whose definition, considering a sample 
X = {xi,X2,...,Xn], is: 

Mad{x) = C{Mad) Med{{\xi - Med{x)\]i) 

where C{Mad) is a consistency coefficient under the assumption of normality 
which has value 1.4826. 

In the controversy between robustness and efficiency^ the behavior of the 
Mad in contaminated samples is very good; with the best possible asymptotic 
breakdown point (1/2) and bounded influence function. Its weakest point is its 
efficiency^ achieving 37% under the assumption of normality. On the other hand, 
there are scale estimators with high breakdown point and high efficiency. For in- 
stance, Yohai and Zamar (1988) obtained a class of regression estimates based on 
scale estimators which simultaneously have high breakdown point and high effi- 
ciency under normality, and Croux (1994) provided M-estimators of scale which 
combine high breakdown point and high efficiency. 

The mean and the median can be regarded as extreme elements of a family 
known as trimmed means. The elements of this family use for their definition a 
parameter a, that will be called trimming levels where a G [0,50), the trimmed 
mean at trimming level a {a.Trimmed) is the mean of the observations of the 
sample, eliminating the a% of the highest observations and the a% of the lowest. 
That is, for a sample x = {xi^X 2 ^ . . . , where we will call X[{^ the observation 
in the ith position of a sequence ranging from the smallest to the largest of the 
elements of x, and considering a = I nt {an/ 100)^ the a. Trimmed of the sample x, 
denoted a-Trim{x) is defined in the following way: 



a_Trim{x) 



1 

n — 2a 






( 1 . 1 ) 



In this way, the 0. Trimmed is the mean and if a tends towards 50 then the 
a-Trimmed tends to the median. So, if we define the 50-Trimmed to be the median, 
we can define a trimming for every a in the interval [0,50]. 

The properties of these estimators were studied in different papers (for in- 
stance, Hoaglin et al., 1983) proving that: for a variable which follows a symmetric 
distribution of parameter //, the a-Trimmed is a class of unbiased estimators of 
fji\ which are affine equivarianf have breakdown point equal to (a -[- l)/n, asymp- 
totic breakdown point a/ 100, which depends on the trimming levels and they are 
B-Robust iff a > 1. On the other hand, its efficiency (considering the relative effi- 
ciency with respect to the mean) under normality can be approximated by means 
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of (100 — 0.723a)%, so the efficiency of the elements in this family varies between 
the 64% of the median and the maximum 100% of the mean. 

As a consequence, the mean {0. Trimmed) has the highest possible efficiency 
for normal distributions (100%), together with the worst behavior in the presence 
of outliers {breakdown point of value 0, unbounded influence function^ that is, 
it is not a B-Robust estimator). On the other hand, the median {50-Trimmed) 
has a lower efficiency under normality (64%), its asymptotic breakdown point is 
the maximum possible, with a bounded influence function^ and definitely, more 
suitable behavior for the treatment of the outliers. 

The possibility of being able to choose an arrangement option amongst the 
generic properties of robustness and efficiency, and considering the structure of 
the two scale estimators defined by means of extreme estimators of the trimmed 
means family, leads us to propose some scale estimators by using the elements of 
such a family. 

Our paper is organized as follows: in Section 2, we propose the new scale esti- 
mators family, by using trimmed means. In Section 3 we study the main properties 
of the elements of this family in the presence of outliers. In Section 4 the behavior 
of this family under normality is analyzed. Finally, in Section 5, we present some 
conclusions. 

2. A New Scale Estimators Family 

Starting from the variance and replacing the means by elements of the trimmed 
means family, it is possible to build a spread estimators family which, as we will 
see later, has suitable behavior in the presence of outliers. 

Since the variance is defined as the mean of the squares of the distances 
between the observations to the mean, we propose to substitute the two used 
means for other two trimmed means. Therefore, in order to define the new spread 
estimators we will need two parameters; one of them used to calculate a location 
estimator, which acts as the mean; and another one, which provides us with a 
mean of the distances of the observations of this location estimator. In this way, 
we define the following concept: 

Definition 2.1. Given a sample x = {x\,X 2 , . . . , Xn}, the parameters a, /3 G [0, 50] 
and the trimmed means family being known, we define the elements afi -Trimmed 
over the sample x, termed af3-Trim{x), as: 

af3-Trim{x) = C{a,f3) f3-Trim{{{xi — a-Trim{x))^}i) 
where C{a,(3) is a consistency coefficient. 

As in the definition of the Mad where the consistency coefficient is sup- 
posed to be invariable under changes of location and/or scale of the sample x, the 
coefficients C{a, (5) will also be independent of these changes. 

On the other hand, by changing a and in the interval [0,50] a family of 
estimators is obtained, so that each of its elements provides a trimmed mean (at 
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level /?) of the distances between the collection of x and a “representative” of 
these ones, calculated by means of another trimmed mean (now at level a). Thus, 
the extremes of such a family are constituted by the variance of x, in the case 
of a = /? = 0, and by an estimator that we will call median of deviations to the 
median (MDM), in case that a = (3 = 50, defined in the following way: 

MDM{x) = C{MDM) Med{{{xi - Med{x))^}i) 

The square root of the MDM is very similar to the Mad, with slight dif- 
ferences between them because of the parity of the sample size. However, MDM 
respects, in a stricter way, the structure of the variance, when considering the 
squares of the distances and not absolute values, which, in practice, can be useful 
at the moment of handling such a concept. 

The square root of the elements of the family given in Definition 2.1, that is 

-Trimmed , based on samples of random variables, can be regarded as scale 
estimators, since they provide some information about the “deviation” or “vari- 
ability” of the elements of the considered sample. 



3, Properties of the New Estimators 

The properties that we are going to analyze in this section are the usual ones in 
order to prove the robustness of a scale estimator (see Rousseeuw and Leroy, 1987; 
or Barnett and Lewis, 1994). 

EflBiciency and Consistency 

In the definition of the elements of the new estimators family consistency coeffi- 
cients appear, which depend on the distribution of the sample variable over which 
the estimator applies, and also depend on the parameters a and (3. These coeffi- 
cients need to be carefully studied, and this will be done in Section 4 in the case 
of samples from normal populations. 

Equivariance 

The equivariance property deals with the behavior of the estimators when there 
are changes in position and/or scale in the sample data. For the proposed scale 
estimators the following proposition is true: 

Proposition 3.1. For a sample x = {xi,X 2 , • . . ,Xn} and r,s ^ IR with rx s = 
{rxi + s, rx 2 + s, . . . , rXn + s}, then it is true that for every a,f3 e [0, 50] which: 

a/3_Trim(rx -(- s) = a/?_Trim(x) 

Proof For the construction of the elements of the a(3 -Trimmed family (Defini- 
tion 2.1), based on the definition of the a -Trimmed, and using the invariance of 
the consistency coefficients under changes in location and in scale, the result is 
immediate. □ 
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Therefore, the changes of position do not affect the elements of the family, 
whereas the behavior of such elements under changes of scale are the most usual 
ones in scale estimators. To sum up, the elements of the yja(3 -Trimmed family are 
affine equivariant scale estimators. 



Exact Fit 

For the study of the exact fit property, the concept exact fit point is used as an 
indicator of the grade of fulfillment. The exact fit point 5* of a scale estimator for 
a sample x = {xi,X2, . . . ,Xn} of size n where Xi = xq Mi measures the minimal 
percentage of contamination in x that is necessary before the estimator yields any 
value other than the desirable one, which is 0. 

Let us consider the following proposition: 



Proposition 3.2. Consider a sample x = {xi,X2 , . . . then the exact fit point 
of a/?_Trimmed for x is: 

(5*(o;^_Trimmed, x) = Min{ ^~^ ^ 

where a = Int{an/100) and b = Int{pn/100) for every a,/3 G [0,50). 

In the particular case of M DM, we have that 5^{MDM,x) = Int{^^^)/n. 

Proof. Let us distinguish between the MDM (case a) and the rest of the family 
elements (case b). Later, in (c) we will show their connection. 

Let us assume, in all the cases, that x = {xi, X2, . . . , x^} is a sample where 
Xi = xq Mi, and xcm = {^i}z=i,2,...,n is a contaminated sample with m observations 
of X replaced by arbitrary points. Denote J the set of indices corresponding to 
contaminated data points in xc^. 

(a): In the case of MDM, we find that the exact fit point is defined as: 

m 

5*(MDM, x) = Min{— / 3 xCm such that MDM{xCm) 0} 

For the median, we know that: 

• If m < 7nt(^^) then Med{xCm) = xq, and therefore: 



(x- Med{xcm))^ 



(x' - xo) 7^ 0 if i G J 
0 otherwise 



so that the MDM{xCm) = 0. 

• If m > Int{^^^) then we can find a contaminated sample xCm such that 
Med{xCm) ^ Xq, and therefore MDM{xCm) 7^ 0. 

Thus, we find: 



77 - 1-1 

6*„{MDM,x) = Int{^-)/n 

(b): In the case of the afi -Trimmed, the exact fit point is defined: 

TTL 

5'^{a f3 -Trimmed, x) = Min { — / 3 xCm such that afi-Trim{xCm) 7^ 0} 
Note that the exact fit point for the a-Trimmed is (a + l)/n which yields: 
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• If m > a then we can find a xcm such that a-Trim{xCm) 7 ^ xq. So, {x[ — 
a-Trim{xCm))‘^ is higher than 0 for every i, and hence af3-Trim{xCm) 
is different from 0 for every /?. 

• If m < a then the number of elements with {x[—a-Trim{xCm))^ different 
from 0 is exactly m. Therefore, if m > 6 then ap-Trim{xCm) 7 ^ 0. 

To sum up, a(3-Trim{xCm) 7 ^ 0 always that m> a or m> b and therefore: 

5^{aP. Trimmed, x) = Min{ , -- - } 

nn 

(c): It is easy to prove that, if a and f3 tend towards 50 then -Trimmed , 
x) tend toward 6!^{MDM,x), since in this case a = b and then: 

• if n is even, n = 2k with fc G N, then a = Int{an/ 100) = k — 1 and 
Int{^^^) = k] whereas, 

• if n is odd, n = 2k I with fc G N, then a = Int{an/100) = k and 
Intm^) = k + h 

Therefore, for every n we have = a + l = 6+ l. 

□ 

Thus, the grade of fulfillment of the exact fit property depends on the con- 
sidered trimming levels', this grade being the highest one in the case of MDM, 
extreme element of the a j3 -Trimmed family. 

Sensitivity Curve 

The sensitivity curve investigates the changes of the estimator that occur when 
we introduce a particular contamination. The sensitivity curve must be bounded, 
which means that the estimator produces controlled values for every contamina- 
tion. 

Let us look at the behavior of the sensitivity curve for the new estimators, 
by means of the following proposition: 

Proposition 3.3. Given the trimming levels a,f3 e [0,50], where a = Int{an/100) 
and b = Int{f3n/ 100), the sensitivity curve of the ay5_Trimmed estimator is 
bounded iff Min{a,b} > 1. 

Proof For a sample of n elements x = {xi, X 2 , . . . , x^} and for any value xq G IR, 
the sensitivity curve for every -Trimmed is: 

SCn{x{)', a/3 -Trimmed, x) = (n + 1) [a/3-Trim{x') — af3-Trim{x)] 
where x' = x|J{xo} and 
a/3-Trim{x') = 

C{a,/3) (3-Trim{{{xi — a-Trim{x'))^}i,{xQ — a-Trim{x'))^) 

Under these conditions, since x' is a contamination of x in an observation, 
if a > 1 then a-Trim{x') is between the minimum and the maximum of the x. 
Moreover, it will be necessary that 6 > 1 in order to ensure that {xo — a-Trim{x'))^ 
is eliminated, if required, so that the af3-Trim{x') is “controlled”. 
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On the other hand, if a = 0 or 6 = 0 then a/3-Trim{x') is ‘^non-controlled” , 
since the value {xq — a-Trim{x'))‘^ can be any value. 

In short, if Mm{a, 6} > 1 we can state that the SCn{xo] (^P-Trimmed, x) is 
bounded. □ 

In this way, the sensitivity curve of the -Trimmed estimator is bounded if 
and only if a and h are greater or equal than 1, that is, if some observation of the 
sample is eliminated. Remember that if a = /3 = 0, then we are considering the 
variance which has a unbounded sensitivity curve. However, if a = P = 50 then 
a and b will be greater than 1, so 50,50-Trimmed=MDM will also have bounded 
sensitivity curve. 

Breakdown Point 

The breakdown point e*(5, x) of a scale estimator S for a sample x of size n is 
defined in Rousseeuw and Croux (1993) by: 

^n{S,x) = Min{e:j;{S,x),e-{S,x)} 

where 

m 

e+(5,x) = Min{— / Sup^cm{S{xCm)} = 00 } 
is called explosion breakdown point, and 

TTi 

e~{S,x) = Min{— / /n/j,c„{5'(a;Cm)} = 0} 
n 

is called implosion breakdown point, and where xCm is a contamination of m 
elements of x then 

Let us calculate the breakdown point of the new estimators. 

Proposition 3.4. For a sample x = {xi,X 2 , . . . ,Xn} we have that the breakdown 
point of the a/J_Trimmed over x is: 

e* (a/?_Trimmed, x) = Min { ^ ^ , ^ -} 

n n 

for every a^j3 G [0,50], with a = Int{an/100) and b = Int{/3n/100). 

Proof. Assuming that xCm is a contamination of m elements of x, and in the 
case of {a /3 -Trimmed ^x), if m > (a + 1) then we can find a contaminated 

sample xCm such that a-Trim{xCm) is not “controlled” if only the information of 
the sample x is known; so, also a(d-Trim{xCm) is not “controlled”. On the other 
hand, if m < (a + 1), so that a-Trim{xCm) is between the highest element and 
the lowest one of the sample x, it is necessary that m > (6+1) in order to 
ensure that we can find a contaminated sample xCm such that al3-Trim{xCm) = 
C{a^ /3)f3-Trim{{{x'- — a-Trim{xCm))^}i) ^ where x' G xCm, is not “controlled” if 
only the information of the sample x is known. 

So, it is necessary that m is lower than (u + 1) and also lower than (6+1) in 
order to ensure that, for every contamination the estimator a (3 -Trimmed is “con- 
trolled” if only the information of the sample x, that is, 
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is finite. Therefore: 



e^{a(3 -Trimmed ^x) — Min{ 



Qj \ 6 “h 1 



n 



} 



Let us concentrate on the case of e~ {a P -Trimmed ^x). It is easy to see that 
at least n — b observations of the set {{x[ — a-Trim{xCrnyf}i being 0 are needed 
in order to ensure aP-Trim{xCm) = 0. Thus, we need a contaminated sample xcm 
with m>n — b. Therefore: 



{aP -Trimmed, x) 



n — b 
n 



It follows that because: 

• if n is even, n = 2k with fc G N, then a + 1 < fc and 6 + 1 < fc. So, since 

fc<fc + l = 2k — k-\-l = n— {k — 1) < n — b, then a + 1 < n — 6 and 

6 + 1 < n — 6; 

• if n is odd, n = 2fc + 1 with fc G N, then a < k and b < k. So, since 

fc + l = 2fc + l — fc = n- fc<n-6, then a + 1 < n — 6 and b-\- \ <n — b. 

To sum up, the breakdown point of the aP -Trimmed is: 
e"^{a p -Trimmed, x) = 



□ 



Regarding this last proposition, we can claim that the asymptotic breakdown 
point of aP -Trimmed is: 

€*{a(3. Trimmed) = Min{^, 

and as a consequence, for the extreme elements of the aP -Trimmed family it follows 
that e* (variance) =0 and e*{MDM) = 1/2. 



4. The New Estimators Under Normality 

In order to have the elements of the ap -Trimmed family defined, it is necessary to 
start by determining the consistency coefficient, which is used in their definition. 
We will then determine the efficiency of the elements of the aP -Trimmed family. 

The analysis will be carried out under the normality assumption, since it is 
the most common one. 

Consistency Coefficients 

Assuming that x is an i.i.d. sample of the variable X = N{fi, a), the Definition 2.1 
provides an estimator family where the coefficient C{a,p) is defined. Therefore, 
the corresponding element of such a family, for the a and P trimming levels, is 
consistent (in mean square) for 
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Table 1 . Consistency coefficients under normality. 



n 


C{a,0) 


0 


1.0000000 


5 


1.2485537 


10 


1.4280902 


15 


1.5884787 


20 


1.7344188 


25 


1.8653508 


30 


1.9788504 


35 


2.0717771 


40 


2.1409476 


45 


2.1836602 


50 


2.1981094 



Under this assumption and considering that every element of the trimmed 
means family is affine equivariant and that the a.Trimmed is an unbiased esti- 
mator of /i, it is true that: 

a(3.Trim{X) = C{a,l3) (3.Trim{{X - a-Trim{X)f) 

= C{a,f3) <7^ 0-Trim{X') 



where X' = Assuming that the values C{a, /3) must ensure the consistency 

of a(3-Trim{X), it must be true that al3-Trim{X) — Applying this condition 
we obtain: 



1-1 



C{a,f3) = [(3-Trim{X')Y 
where X' follows a distribution Xi- 

If the distribution function of an Xi is denoted F then for P G [0,50), and 
ai,a 2 G IR where F{ai) = /3/100 and F{a 2 ) = 1 - (/3/100) we find that 



The consistency coefficients for some values of /? in the interval [0,50] are 
given in Table 1, where for /? = 50 we find that P-Trim{X') = F~^(l/2), so 
that the consistency coefficient for MDM coincides with the square of the one 
calculated for the Mad estimator. 



Efficiency 

Once the values of the consistency coefficients are calculated, it is possible to 
determine the efficiency of the elements of the aP -Trimmed family, where the 
considered efficiency is the relative efficiency of the elements of the aP -Trimmed 
family with regard to the corrected variance. 

The efficiency of the elements of the aP -Trimmed family will be estimated by 
means of a simulation study. For that reason, 10000 samples of size 1000 have been 
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Table 2. Efficiency of the aj3 -Trimmed under normality with 
respect to the corrected variance. 



a (3 


0 


5 


10 


15 


20 


25 


30 


35 


40 


45 


50 


0 


100 


87.17 


79.04 


72.31 


65.42 


58.72 


53.33 


49.77 


44.78 


43.06 


40.02 


5 


99.97 


87.08 


78.69 


71.84 


64.98 


58.53 


53.11 


48.25 


44.76 


42.98 


39.66 


10 


99.95 


87.06 


78.61 


71.64 


64.81 


58.50 


53.01 


48.23 


44.76 


42.81 


39.64 


15 


99.94 


86.97 


78.55 


71.50 


64.75 


58.46 


52.96 


48.23 


44.74 


42.63 


39.62 


20 


99.91 


86.85 


78.47 


71.50 


64.69 


58.44 


52.91 


48.23 


44.73 


42.63 


39.61 


25 


99.86 


86.77 


78.41 


71.49 


64.66 


58.35 


52.80 


48.22 


44.72 


42.49 


39.35 


30 


99.78 


86.69 


78.35 


71.44 


64.65 


58.30 


52.66 


48.22 


44.70 


42.48 


39.31 


35 


99.82 


86.66 


78.14 


71.27 


64.62 


58.26 


52.59 


48.16 


44.68 


42.48 


39.26 


40 


99.76 


86.48 


77.86 


71.08 


64.58 


58.06 


52.44 


47.95 


44.63 


42.07 


39.11 


45 


99.51 


86.38 


77.56 


70.90 


64.40 


57.91 


52.18 


47.54 


44.39 


42.01 


38.85 


50 


99.20 


86.31 


77.44 


70.88 


64.19 


57.82 


51.84 


47.51 


44.27 


41.68 


38.35 



generated from A^(0, 1). For each sample the elements of the a(3 -Trimmed family 
have been calculated for values of a and (3 which are within the interval [0,50] 
in 5 by 5. Afterwards we have calculated the variance of the resulting estimates 
and compared it to the variance of the classical variance estimator, which is also 
calculated for these samples. 

As a result of this simulation we have obtained Table 2, where it can be 
noticed that the ajS -Trimmed family varies in a wide range of efficiencies. 



5. Final Comments 

We have defined the scale estimators family y/ a fi -Trimmed and analyzed the ro- 
bustness and efficiency properties of the elements in this family. 

With regard to the robustness, for a sample of size n with a = /nt(an/100) 
and b = /nt(/?n/100) for a,P E [0,50], it has been proved that the estima- 
tors in this class are affine equivariant with exact fit point of value Min{{a -h 
l)/n, (6-f l)/n}, bounded sensitivity curve when Min{a,b} > 1, breakdown point 
equal to Min{{a -{-!)/ n, (5-h l)/n} and corresponding asymptotic breakdown point 
Mm{a/100,/3/100}. 

On the other hand, the efficiency of the elements of this family (assum- 
ing relative efficiency with regard to the corrected variance) has been obtained. 
Depending on the trimming levels, the efficiency of the aj3 -Trimmed is between 
38.35% and 100%. 

To sum up, the aP -Trimmed provides a family of spread estimators where it 
is possible to choose certain efficiency levels or robust qualities. This is the main 
advantage of this family compared to other scale estimators, such as the Mad, the 
Shorth or the ones defined in Rousseeuw and Croux (1993), called S* and Q*. 
Thus, depending on the case, or even on the researcher’s preferences, it is possible 
to choose an element of the aP -Trimmed family which provides an adequate scale 
estimator. 
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Robust Efficient Method of 
Moments Estimation 

C. Ortelli and F. Trojani 

Abstract. This paper focuses on the robust Efficient Method of Moments 
(EMM) estimation of a general parametric stationary process and proposes a 
broad framework for constructing robust EMM statistics in this context. We 
characterize the local robustness properties of EMM estimators for time series 
by computing the corresponding influence functions and propose two versions 
of a robust EMM (REMM) estimator with bounded influence function. We 
then show by Monte Carlo simulation that our REMM estimators are very 
successful in controlling for the asymptotic bias under model misspecification 
while maintaining a high efficiency under the ideal structural model. 

Mathematics Subject Classification (2000). 62F10, 62F35, 62G05, 62G35, 
62M09. 

Keywords. Efficient Method of Moments, indirect inference, influence func- 
tion, robust estimation, robust statistics. 



1. Introduction 

This paper analyzes the local robustness properties of estimators derived from the 
Efficient Method of Moments (EMM, Gallant and Tauchen, 1996) and develops 
a new class of robust statistics for the statistical analysis of parametric models 
that are estimated within a general EMM framework. Specifically, we focus on 
the robust EMM estimation of a general parametric stationary process where the 
implied stationary density may be not computed analytically and propose a broad 
framework for constructing robust EMM statistics in this context. 

Some authors already addressed some important issues related to a robust 
inference on time series models, starting from different perspectives. For instance, 
some first definitions of an influence function (IF) for times series (extending the 
basic definitions in Hampel’s (1974) seminal work) have been developed in Kiinsch 

Both authors gratefully acknowledge the financial support of the Swiss National Science Foun- 
dation (grant 12-65196.01 and NCCR FINRISK). 
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(1984) and Martin and Yohai (1986). Kiinsch (1984) also derived the optimality 
of a robust M-estimator for the parameters of a linear autoregressive process with 
normally distributed error terms. In a similar vein, de Luna and Genton (2001) 
proposed more recently a robust estimator for the parameters of a linear ARMA 
process. In the nonlinear time series context Ronchetti and Trojani (2001) in- 
troduced a robust version of a general Generalized Method of Moments (GMM) 
statistic for situations where a reference model for the underlying data distribu- 
tion can be assumed. Finally, Genton and Ronchetti (2003) have addressed in some 
simple model setting (using an indirect inference approach a la Gourieroux et al., 
1993) the issue of robust estimation for models where the stationary density of the 
structural model cannot be computed explicitly. 

We address the problem of a robust inference on general parametric models 
for time series from a broad perspective that allows us to extend the application 
field of robust procedures for time series essentially to all models that can be es- 
timated by a classical (nonrobust) EMM estimator. We compute the time series 
infiuence function (Kiinsch, 1984) of a general EMM estimator and propose two 
robust EMM estimators with bounded IF that ensure a bounded asymptotic bias 
in neighborhoods of the given structural model. We then present some Monte Garlo 
experiments attempting to quantify the trade-off between robustness and efficiency 
in the REMM estimation of a highly nonlinear model. To this end we estimate a 
structural ARMA(1,1)-ARCH(2) model by means of an auxiliary AR(3)-ARGH(2) 
model. Both our REMM estimators yield very satisfactory results in these experi- 
ments. Indeed, in our simulations the efficiency loss implied by a REMM estimation 
under a perfectly specified structural model is virtually negligible, when compared 
with the results of the classical EMM procedure. Further, we observe that even 
a quite moderate model contamination can induce a very important bias and a 
strong loss in efficiency of a classical EMM estimator while both our REMM esti- 
mators are very successful in bounding the induced asymptotic bias and efficiency 
loss. Finally, we find the outlier identification procedure implied by the robust 
weights of our REMM procedures to be very efficient in all our experiments. 

The paper is organized as follows. Section 2 introduces the standard (nonro- 
bust) EMM setting. Section 3 computes the time series IF of an EMM estimator 
and introduces two REMM estimators that bound the implied IF in an appropriate 
metric. Section 4 presents some Monte Carlo experiments where the performance 
of our REMM estimators is evaluated in a nontrivial EMM setting. Section 5 
summarizes and concludes. 



2. Basic EMM Setting 

Let X = {Xt : ^ G N, t = 1, 2, . . . } be a strictly stationary and ergodic 

stochastic process on a complete probability space (fi,3, Qo)- The goal in EMM 
estimation is to produce statistical inference on the probability Pq := Qo^~^ 
based on a structural model V = {Pp : p E TZ cW J eN} that defines for any 
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structural parameter p e TZ s. probability measure Pp on the measurable space 
^^voo In classical EMM estimation V is assumed to be cor- 

rectly specified for Pq. 

We denote by x'^ := {x[,X 2 ^ . . . the observed finite history of X'^ := 
(X(,X 2 , . . . ,X'^y and by the subvector consisting of the last m compo- 

nents of 0 < m < n. In EMM estimation the focus is on model settings where 
the (conditional) density of X~^ {Xn | X'^~^) cannot be expressed analytically, 
so that direct estimation of Pp^ is not feasible. Instead, an auxiliary parametric 
model and an auxiliary parameter OeOcR^,q>l are introduced, which can be 
estimated for instance by pseudo maximum likelihood. This defines an auxiliary 
estimation procedure that estimates in a first step of the EMM the underlying 
probability Pp^ in an approximate way. When a well-defined binding function (see 
Gourieroux et ah, 1993) between the structural and auxiliary parameters exists, 
an estimate of p can be recovered from the estimate of 6 by minimizing a quadratic 
form. 

Often the auxiliary model is defined using a parametric family of pseudo- 
likelihood functions that induce a corresponding set of estimating equations, and 
we could do the same to develop our robust EMM (REMM) methodology. How- 
ever, it is convenient for our robust analysis and straightforward from a statistical 
perspective to work with a larger class of auxiliary models, which can be estimated 
by some M-estimator (Huber, 1964) defined by a general score function 

tp xQ . (2.1) 

To characterize the robustness of EMM statistics we need to write EMM estimators 
and tests as functionals on a suitable space of distributions. For our purposes, 
focusing on the finite dimensional distributions induced by the strictly stationary 
distributions on (E^^, 5(M^'^)), as in Kiinsch (1984), will be enough. Thus, define 
for L > 1 the following set of finite-dimensional distributions 

Mgtat — {(T)-finite-dimensional marginals of strictly stationary processes}. 

Moreover, let for L < n the empirical (L)-dimensional marginal distribution of 
be defined by 

n 

Pn \Xi=Xi-nioxi>n, (2.2) 

i=l 

where 5^l is the Dirac mass at x^ G By construction P^ G 

To analyze the local robustness of statistical functionals derived from an 
EMM estimator, the first step is to define a statistical functional 9 {•) for the 
estimator of the auxiliary parameter 6, By definition, 0 {•) is the functional solution 
of the asymptotic estimating equations implied by the score function (2.1), that is 

e : dom{9) C M^tat (pq := 9pL , 

is the functional solution of the implicit equation 

EpL 



(2.3) 
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in 0, Notice, that when restricting the auxiliary functional 6{-) on the set := 
{Pp : p e TZ it follows 

Ep. =0 , (2.4) 

SO that to any p G 7^ a unique 0pL can be associated, provided that (2.3) can be 
uniquely solved for any G dom{0). Thus, in this case a well-defined binding 
function mapping p to 6 pL is obtained. 

Axiom 2.1. There exists a smooth injective binding function 

b^ :TZ — >Q , p I — > b^p (p) 6 (p) := 0pL . 

In the new notation we thus have 6o = 0{po) = b^{po). Furthermore, notice 
that different auxiliary score functions 'll;, of the form (2.1) inducing possibly 
different functionals 0, 0', will produce different binding functions b^, 6^/, if and 
only if 0 . 

pL ' pL 

P P ^ ^ 

The pseudo-true value 0^ is estimated by 0n '•= 0 [Pn) which solves (2.3) with 
respect to the empirical measure P^. Under regularity conditions the auxiliary 
estimator is consistent and asymptotically normal at the model. 

Proposition 2.2. Under regularity conditions on (0n] (see, e.g., Gallant and 

V / n£N 

Tauchen, 1996) it follows 



where 

and 



f lim II 0n - 6>o 11= 0 , a.s. - Pq 

} n— >oo 

\ ^{6n-eo) ^ N{0,Vo) 

\ n-^oo 

Vo = 

Ao = £^p. !,=,„) , 

oo / oo \ ^ 

Bo = Bo,o + Bo r + I ) ’ 

T=1 \r=l / 



(2.5) 

( 2 . 6 ) 

(2.7) 

( 2 . 8 ) 
(2.9) 



The statistical functional p{-) for the EMM estimator of the structural model is 
defined as the minimizer of a quadratic form in (2.4), given the value 6p^ of the 
auxiliary estimator, i.e.. 
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where S is some positive definite deterministic matrix. Thus, when denoting by 
m :TZx Q — > the function given by 

m{p,9)^EpL{iP{X^-,e)) , 

the functional 

p : dom{p) C dom{6) -^TZ , \ — ^ p{P^) •= ppL , 

is the functional solution of the implicit equation 

m{p, OpL )' S 6 pl) = 0 , (2.10) 

in p. Under the given assumptions we have p [Pq) = po. Remind that we focus on 
structural models where P^ is not expressible in an analytical form; therefore the 
expectation in (2.10) will have to be computed by Monte Carlo simulation, using 
some simulated series x^(p), for k sufficiently large. 

The structural parameter is estimated by pn := p(P^), which solves (2.10) 
with respect to the empirical measure P^, and for some positive definite sequence 
Sn converging a.s. — Pq to S. Specifically, pn is such that 

d 

m{p„,ej' Sn-^m{pn,0J = 0 . 

Weak convergence a.s. — Pq of to Pq implies with Proposition 2.2 and some 
further regularity conditions on m (see Gallant and Tauchen, 1996), that also the 
sequence {pn)neN is consistent and asymptotically normal at the model Pq. 

Proposition 2.3. Under regularity conditions on {pn)neN it follows 

{ lim ||p„-po||=0 , a.s. -Po 

n^oo 

y/niPn-Po) A^(0,S5) 

n—^oo 

where 

Es = {M'pSMp)~' M'pSMeVoM'eSMp , (2.11) 

with 

d d 

^)\p=po,d=0o 5 ^9 • ^(P? ^)\p=po ,9=9 q • (2.12) 

(qxl) (^xg) OU 

Standard regularity conditions allowing the interchange of integration and differ- 
entiation in (2.12) imply Mp = Aq (see 2.7), so that finally (2.11) reads 

Es = {M'pSMpY^ M'pSBoSMp {M'^SMpY^ . (2.13) 

The next section analyzes the local robustness properties of EMM estimators 
and statistics derived from EMM estimators. 
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3. Robust EMM Estimators 

We denote by dom (T) the domain of a statistical functional T. For an M-estimator 
defined by a score function the IF is given by the standard expression (see also 
Hampel et ah, 1986, p. 230) 

/F(x^;0,Po^) = -M,-i (3.1) 

The IF of the EMM functional p is obtained by implicitly differentiating the EMM 
first-order condition 

m(p(P^),0(P^))'5^m(p(P^),0(P^)) =0 , (3.2) 

in direction S^l . We then have 

IF{x^-,p, P(f ) = - (M;5 Mp) M'pSMe IF{x^; I P^) . (3.3) 

Thus, the IF of the EMM estimator depends linearly on the IF of the estimator of 
the auxiliary parameters, implying that the IF of the EMM estimator is bounded 
if and only if the estimator of the auxiliary parameter has a bounded IF. Finally, 
replacing (3.1) in (3.3) it follows 

IF{x^- p, Po^) = (m;5Mp)“' m;5 V^(x^ 6o) . (3.4) 

Hence, for a ^ G such that '0(^;^o) is large, ( may have no influence on the 
structural parameters if the vector '0(^;0o) belongs to the kernel of M'5. Note 
that the I x q matrix M' has full row rank I so that the dimension of its kernel is 
equal to ^ 

We consider two robust EMM estimators with bounded IF that correct si- 
multaneously for two sources of asymptotic bias arising in the REMM setting: (i) 
the standard bias arising in the EMM estimation of an auxiliary model and (ii) 
the usual bias induced by a robust M-estimation of the auxiliary model via the 
truncation of an unbounded auxiliary score function. 

3.1. Bounding the IF 

Following Ronchetti and Trojani (2001), the construction of a REMM estimator 
is performed by bounding the IF of p with respect to the metric induced by its 
variance-covariance matrix. Formally, we want the IF of p to satisfy 

II IF{x^;p,Po^) lls-i :=|| Es'^^IF{x^-p,P,^) || < c, (3.5) 

where c is an a priori positive bound on the self-standardized sensitivity of p (see 
also Hampel et al., 1986, Chapter 4). By (3.5) and (2.13) this gives 

II IF{x^-p,P^) ||2^,- i>{x^-eo)'SMp {M'pSBoSMpy' M'^S i^{x^-,eo). (3.6) 

As a simple alternative, the properties of orthogonal projections for the matrix 

K := {M'pSBoSMp)~^ M’^SbI^^ 
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imply 

II IF{x^;p,P^^) =|| IF{x-APo^) ||^-i • (3.7) 

The implications of this last result are very useful. Firstly, bounding the self- 
standardized norm of the score of the auxiliary model is sufficient in order to 
bound the self-standardized IF of the EMM estimator of the structural parameters. 
Secondly, in contrast to || IF{x^]p^Pq) H5.-1 in order to bound 'ip in the metric 

defined by it is not necessary to compute numerically the matrix Mp. 

3.2. Definition of Robust EMM Estimators 

Construction of robust EMM estimators having a self-standardized IF bounded 
by c can be performed by truncating the function 'ip{x^;6o) with an appropriate 
algorithm; see also Hampel et al. (1986), pp. 238 ff. We now present two procedures 
by which this can be achieved. More details on the algorithms for computing our 
REMM estimators are available in Ortelli and Trojani (2002). In both procedures 
we replace the (unbounded) function 'ip of a, classical (nonrobust) EMM estimator 
by a new bounded one, denoted by \p^ and ip^^ respectively. Computation of Mp 
is required for 'ip^ but not for xp^ as the latter makes use of inequality (3.7). In the 
sequel we refer to REMMl for the robust estimator based on ip^ and to REMM2 
for that based on 'ip^. Because of its simplicity, we start with the construction 
of 'ip^. 

For a non-singular matrix ^ G x we define the truncated auxiliary 
score function 

0) := Axp{x^] 9) Wc{A'ip{x^;6)) , (3.8) 

where Wc{x) = 1 if x = 0, Wc{x) = min(l,c/ || x ||) otherwise, and || • || denotes 
the Euclidean norm. The new binding function b^A : TZ — >6, p ^ 9 = b^A {p) 

and the functional estimator 9 are now implicitly defined as the solution of the 
nonlinear system of equations 

EpL [ipy{X^-,e)] = 0 and EpL 0)] = 0 , (3.9) 

respectively. The binding function b^A is generally different from that implied by 
the score function 'ip. However, Fisher consistency of the structural parameters 
is naturally maintained in the EMM framework because of the second step. The 
non-singular matrix A is determined by solving 

/ = Bo = EpL{i;y{X^-,eo)i^^{X^-,eoy) 

oo 

+ Ep, {^y{x^-, 6o)^y{xyy:[-, eoy) 

r=l 

oo 

+ ^ Ep, (V’,^(xG+r ; Oo)'). (3.10) 

T = 1 

Equation (3.10) ensures Bq = I for the robust EMM estimator implied by the 
score xp^ so that the self standardized IF of p is automatically bounded by c. 
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Construction of the second version of a robust EMM estimator is similar to 
the first one. The truncated auxiliary score is defined as 

6) := 0) wdAM'pS 6 )) , (3.11) 

where ^ G x RMs non-singular. Here, the norm of is measured with respect 
to the semi-metric induced by the matrix SMpA'AM^S which takes into account 
the fact that we are interested in the robustness of the structural estimator and 
not necessarily in the one of the auxiliary estimator. Assume that I < q. Because 
rank {Mp) = I and rank (S) = q, observations with large influence on the auxiliary 
parameters may have small or even zero influence on the structural parameters and, 
therefore, do not have to be downweighted. Furthermore, (3.5) is true whenever 

A'A={M'pSBoSMp)~^ (3.12) 

is satisfied. In fact, substituting (3.12) in (3.6) we obtain 

II IF(x^-,p,Po^) \\l-^= SMpA'AM'pSi>^{x^-,9o) < 

4. Monte Carlo Simulations 

We extend the simple MA(1) model analyzed in Martin and Yohai (1986) to an 
ARMA(1,1)-ARCH(2) process given by 

Yt = 0.917-1 + et - 0.5et_i, ut iid N (0,1) (4.1) 

ht = 0.04 + 0.6e^_i + 0.3e^_2 . 

We estimate the structural model (4.1) by means of an AR(3)-ARCH(2) auxiliary 
model. Since we include a constant term in the estimation of the structural and 
the auxiliary model the total number of auxiliary parameters q is equal to 7 while 
I is equal to 6. For both robust estimators REMMl and REMM2 the robustness 
tuning constant c is set equal to 10. Other levels of c gave similar findings. 

The upper panel of Figure 1 shows a simulated path from model (4.1) where 
the observations have been contaminated with the replacement outlier model de- 
fined by 

= + (4.2) 

represents an iid 0-1 process independent of Y and P {H^ = 1) = e. We fixed 
^ = 3 and e = 1%. For this path only 4 observations have been replaced (observa- 
tions 180, 548, 660 and 797). Note there the infrequent large positive and negative 
movements which typically occur in periods of higher volatility. At first sight they 
may easily be identified as outliers in a naive model misspecification analysis. Even 
when knowing the form of the outliers generating process, their exact identification 
in Figure 1 could be a very difficult task. 
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Figure 1. Contaminated series (top panel) and weights implied 
by the REMM. 



Table 1 summarizes the results obtained from 1000 uncontaminated simula- 
tions of model (4.1) by presenting summary statistics and mean squared errors. 
For each pi in Table 1 the first row contains the summary statistics for the param- 
eter estimates under a classical EMM estimation. The summary statistics for the 
REMMl and REMM2 estimators are given, respectively, in the second and the 
third row for each pi in Table 1. The sample size behind all simulations is 800. 



Prom Table 1 it is evident that the efficiency loss at the model when using 
REMM is virtually negligible. Indeed, the mean squared errors of all parameters 
estimates are very similar across the different EMM and REMM estimators. The 
effects of the contamination on the resulting parameter estimates are summarized 
in Table 2 with corresponding summary statistics. The results for the classical 
EMM estimator in Table 2 highlight some large biases and mean squared errors 
of a standard EMM model estimation. Indeed, we observe that the mean squared 
error of all EMM parameter estimates in Table 2 is highly inflated by the presence 
of contamination (compare for instance the mean squared errors for the estimates 
of Pi and p 4 given in Table 1 and 2) and that some large biases are obtained 
especially in the parameter estimates of the conditional variance equation. The 
REMM procedures, on the other side, are very successful in controlling both for 
bias and efficiency in the presence of contamination. Indeed, when compared with 
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Table 1. Summary statistics under the uncontaminated model (4.1). 



True 


Method 


Mean 


Median 


Q25 


Q75 


Stdv 


Q75 - Q25 


MSE 




EMM 


0.0005 


0.0005 


-0.0025 


0.0036 


0.0048 


0.0061 


2.33E-5 


0 


REMMl 


0.0008 


0.0010 


-0.0022 


0.0039 


0.0049 


0.0061 


2.43E-5 




REMM2 


0.0008 


0.0009 


-0.0023 


0.0037 


0.0048 


0.0060 


2.38E-5 




EMM 


0.8963 


0.8978 


0.8842 


0.9098 


0.0199 


0.0255 


4.10E-4 


0.9 


REMMl 


0.8954 


0.8974 


0.8837 


0.9083 


0.0196 


0.0246 


4.06E-4 




REMM2 


0.8964 


0.8985 


0.8847 


0.9088 


0.0197 


0.0241 


4.02E-4 




EMM 


-0.5036 


-0.5033 


-0.5408 


-0.4653 


0.0563 


0.0755 


3.19E-3 


-0.5 


REMMl 


-0.5014 


-0.5000 


-0.5360 


-0.4644 


0.0556 


0.0716 


3.09E-3 




REMM2 


-0.5054 


-0.5034 


-0.5401 


-0.4697 


0.0556 


0.0704 


3.11E-3 




EMM 


0.0397 


0.0393 


0.0363 


0.0429 


0.0048 


0.0066 


2.33E-5 


0.04 


REMMl 


0.0400 


0.0396 


0.0365 


0.0431 


0.0048 


0.0066 


2.30E-5 




REMM2 


0.0403 


0.0399 


0.0369 


0.0434 


0.0048 


0.0065 


2.33E-5 




EMM 


0.5846 


0.5844 


0.5339 


0.6389 


0.0745 


0.1050 


5.78E-3 


0.6 


REMMl 


0.5900 


0.5923 


0.5379 


0.6436 


0.0761 


0.1057 


5.90E-3 




REMM2 


0.5744 


0.5746 


0.5256 


0.6257 


0.0730 


0.1001 


5.98E-3 




EMM 


0.2922 


0.2923 


0.2513 


0.3332 


0.0607 


0.0818 


3.75E-3 


0.3 


REMMl 


0.2959 


0.2950 


0.2539 


0.3375 


0.0617 


0.0836 


3.82E-3 




REMM2 


0.2986 


0.2992 


0.2575 


0.3414 


0.0606 


0.0839 


3.67E-3 



EMM, all mean squared errors in Table 2 are much smaller and the estimates in 
the conditional variance equation present a quite reasonable bias. 

Finally, in all our simulations we have observed REMM to identify correctly 
outliers when they were generated by the contaminating model used. To illus- 
trate this point Figure 1 presents again the contaminated path with the estimated 
REMM weights (in the bottom panel). Weights clearly below one indicate an in- 
fluential observation and can be used to identify outlying observations or more 
general model misspeciflcations. As shown in the bottom panel of Figure 1 all 
generated outliers were clearly indicated with the lowest REMM weights. On the 
other hand, even some model-generated very erratic movements, as for instance the 
ones at the observation 294, 480 and 614, were correctly identifled as observations 
generated by the underlying structural model. 



5. Conclusions 

We characterized the local robustness properties of EMM estimators for time series 
and proposed two versions of a REMM estimator with bounded IF. We verifled in a 
Monte Carlo simulation study that REMM estimators are successful in controlling 
for the asymptotic bias under model misspeciflcation while maintaining a high 
efficiency under the ideal structural model. REMM extends the application held 
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Table 2. Summary statistics under model (4.1) contaminated by 
the replacement model (4.2). 



True 


Method 


Mean 


Median 


925 


975 


Stdv 


975 - 925 


MSE 


0 


EMM 


0.0100 


0.0091 


0.0018 


0.0172 


0.0127 


0.0154 


2.59E-4 


REMMl 


0.0009 


0.0010 


-0.0025 


0.0045 


0.0056 


0.0071 


3.17E-5 


REMM2 


0.0010 


0.0011 


-0.0024 


0.0046 


0.0056 


0.0070 


3.28E-5 


0.9 


EMM 


0.8793 


0.8852 


0.8585 


0.9096 


0.0464 


0.0511 


2.58E-3 


REMMl 


0.8861 


0.8889 


0.8730 


0.9023 


0.0225 


0.0293 


7.02E-4 


REMM2 


0.8865 


0.8895 


0.8732 


0.9031 


0.0235 


0.0299 


7.33E-4 


-0.5 


EMM 


-0.4913 


-0.4877 


-0.5484 


-0.4266 


0.0979 


0.1218 


9.65E-3 


REMMl 


-0.4854 


-0.4857 


-0.5219 


-0.4488 


0.0541 


0.0731 


3.14E-3 


REMM2 


-0.4855 


-0.4856 


-0.5216 


-0.4495 


0.0570 


0.0721 


3.46E-3 


0.04 


EMM 


0.1095 


0.1023 


0.0703 


0.1422 


0.0474 


0.0719 


7.08E-3 


REMMl 


0.0482 


0.0472 


0.0424 


0.0531 


0.0088 


0.0107 


1.45E-4 


REMM2 


0.0488 


0.0480 


0.0430 


0.0537 


0.0082 


0.0106 

1 


1.44E-4 


0.6 


EMM 


0.5499 


0.5473 


0.4216 


0.6672 


0.1722 


0.2456 


3.21E-2 


REMMl 


0.5996 


0.5995 


0.5375 


0.6671 


0.0925 


0.1296 


8.55E-3 


REMM2 


0.5795 


0.5825 


0.5180 


0.6490 


0.0973 


0.1310 


9.88E-3 


0.3 


EMM 


0.2415 


0.2187 


0.1277 


0.3254 


0.1499 


0.1977 


2.59E-2 


REMMl 


0.2554 


0.2567 


0.2003 


0.3053 


0.0748 


0.1050 


7.59E-3 


REMM2 


0.2651 


0.2648 


0.2078 


0.3174 


0.0797 


0.1096 


7.57E-3 



of robust statistics to very general times series models and permits in a natural 
way a robust estimation of non Markovian processes. 



References 

[1] X. de Luna and M.G. Genton, Robust Simulation- Based Estimation of ARM A Mod- 
els. J. Comput. Graph. Statist. 10 (2001), 370-387. 

[2] A.R. Gallant and G. Tauchen, Which moments to match? Econometric Theory 12 
(1996), 657-681. 

[3] M.G. Genton and E. Ronchetti, Robust Indirect Inference. J. Amer. Statist. Assoc. 
98 (2003), 67-76. 

[4] C. Gourieroux, A. Monfort, and E. Renault, Indirect Inference. J. Appl. Econom. 8 
(1993), 85-118. 

[5] F.R. Hampel, The Influence Curve and its Role in Robust Estimation. J. Amer. 
Statist. Assoc. 68 (1974), 383-393. 

[6] F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel, Robust Statistics: 
The Approach Based on Influence functions. Wiley, New York, 1986. 

[7] P.J. Huber, Robust Estimation of a Location Parameter. Ann. Math. Statist. 35 
(1964), 73-101. 




282 



C. Ortelli and F. Trojani 



[8] H. Kiinsch, Infinitesimal Robustness for Autoregressive Processes. Ann. Statist. 12 
(1984), 843-863. 

[9] R.D. Martin and V.J. Yohai, Influence Functionals for Time Series. Ann. Statist. 
14 (1986), 836-837. 

[10] C. Ortelli and F. Trojani, Robust Ejficient Method of Moments. Working pa- 
per, University of Southern Switzerland, 2002, available at http://papers.ssrn.com/ 
sol3 /papers. cfm?abstract-id=335700. 

[11] E. Ronchetti and F. Trojani, Robust Inference With G MM Estimators. J. Economet- 
rics 101 (2001), 37-69. 

C. Ortelli 

Institute of Finance 

University of Lugano 

Switzerland 

e-mail: claudio . ortelliOlu . unisi . ch 

F. Trojani 

Institute of Finance 

University of Lugano 

and 

Department of Economics 

University of St. Gallon 

Switzerland 

e-mail: f abio .trojaniQlu. unisi . ch 




Statistics for Industry and Technology, 283-295 
@ 2004 Birkhauser Verlag Basel/Switzerland 



Computational Geometry and 
Statistical Depth Measures 

E. Rafalin and D.L. Souvaine 



Abstract. The computational geometry community has long recognized that 
there are many important and challenging problems that lie at the interface of 
geometry and statistics (e.g., Shamos, 1976; Bentley and Shamos, 1977). The 
relatively new notion of data depth for non-parametric multivariate data anal- 
ysis is inherently geometric in nature, and therefore provides a fertile ground 
for expanded collaboration between the two communities. New developments 
and increased emphasis in the area of multivariate analysis heighten the need 
for new and efficient computational tools and for an enhanced partnership 
between statisticians and computational geometers. 

Over a decade ago point-line duality and combinatorial and computa- 
tional results on arrangements of lines contributed to the development of an 
efficient algorithm for two-dimensional computation of the LMS regression 
line (Souvaine and Steele, 1987; Edelsbrunner and Souvaine, 1990). The same 
principles and refinements of them are being used today for more efficient com- 
putation of data depth measures. These principles will be reviewed and their 
application to statistical problems such as the LMS regression line and the 
computation of the half-space depth contours will be presented. In addition, 
results of collaborations between computational geometers and statisticians 
on data-depth measures (such as half-space depth and simplicial depth) will 
be surveyed. 

Mathematics Subject Classification (2000). 68U05, 62-07. 

Keywords. Computational geometry, data depth, LMS regression, duality, 
half-space depth, simplicial depth. 



1. Introduction 

The field of Computational Geometry deals with the systematic study of algorithms 
and data structures for geometric objects (de Berg et ah, 1997). Computational 
geometry usually focuses at the outset on exact algorithms that are asymptotically 
fast. Yet once exact algorithms have been obtained, refined, and are still slow, 



Partially supported by NSF grant EIA-99-96237. 




284 



E. Rafalin and D.L. Souvaine 



approximation algorithms of provable performance are sought. The field emerged 
in the 1970’s with the work of Michael Shamos (1978). Even in these early days the 
computational geometry community has recognized that there are many important 
and challenging problems that lie at the interface of geometry and statistics (e.g., 
Shamos, 1976; Bentley and Shamos, 1977). 

The field is commonly related to problems in robotics, CAD /CAM and geo- 
graphic information systems. However, any problem that can be represented using 
geometric objects and operators can be viewed as a computational geometry prob- 
lem, including the relatively new notion of data depth for non-parametric multi- 
variate data analysis. Data depth is inherently geometric in nature, and therefore 
provides a fertile ground for expanded collaboration between the two communities. 

A data depth measures how deep (or central) a given point x G is relative to 
F, a probability distribution in or relative to a given data cloud. Some examples 
of data depth functions are Half-space Depth (Hodges, 1955; Tukey, 1975), Ma- 
jority Depth (Singh, 1993), Simplicial Depth (Liu, 1990), Oja Depth (Oja, 1983), 
Convex Hull Peeling Depth (Barnett, 1976; Eddy, 1982) and Regression Depth 
(Rousseeuw and Hubert, 1999). The data depth concept provides center- outward 
orderings of points in Euclidean space of any dimension and leads to a new non- 
parametric multivariate statistical analysis in which no distributional assumptions 
are needed. Most depth functions are defined in respect to a probability distribu- 
tion F, considering {Xi,. . . , random observations from F. The finite sample 
version of the depth function is obtained by replacing F by Fi, the empirical dis- 
tribution of the sample {Xi, . . . , X^}. In general, computational geometers study 
the finite sample case in which sets of points are investigated. 

Over a decade ago, point-line duality and combinatorial and computational 
results on arrangements of lines contributed to the development of efficient algo- 
rithms for two-dimensional computation of the LMS regression line (Souvaine and 
Steele, 1987; Edelsbrunner and Souvaine, 1990). The same principles and refine- 
ments of them are being used today for more efficient computation of data depth 
measures. 

The technique of point-line duality, its application to LMS, and various sweep 
techniques will be sketched in Section 2. More recent results for data-depth related 
problems will be reviewed in Section 3 (including half-space depth and simplicial 
depth). Sections 4 and 5 will include suggestions for future collaboration and a 
summary. 



2. History 

2.1. The Duality Transform (Brown, 1980; Dobkin and Souvaine, 1987) 

The structure of a collection of lines is more readily observed than the structure 
implied by a set of points. This structure can be exploited to create efficient al- 
gorithms. A set of points can be transformed into an arrangement of lines using 
a duality transform that preserves key properties (Brown, 1980). Many problems 
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Figure 1 . The duality transform T : The set of points and lines 
on the left is transformed into the set on the right. The points 
A = (1,2), 5 = (2,1), (7 = (3,0) and D = (4,-1) lying on the 
line I : y = X 3 are mapped to the lines T{A) : y = x + 2, 

T{B) :y = 2x-\-l, T{C) : y = 3x and T{D) : y = Ax - 1, all 
passing through the point (1,3), dual to the line 1. The points 
A^B^C^D are located above m : ^ = 2x + 2, as are their dual 
lines, above point T{m) : (2, —2). In addition, the vertical distance 
between every point A^B^C^D to the line m is identical to the 
vertical distance between their dual lines and the dual point T{m). 

based on sets of points have been solved efficiently by first transforming the points 
into a set of lines, solving a related problem on the set of lines, and converting 
the answer to a solution for the original problem. The transformation itself is con- 
ducted in linear time, and consequently does not affect the total computational 
complexity of the solution. Different transformations apply to different problems. 
Here, we will focus on a particular transform suitable for the statistical problems 
we treat. For more details see Dobkin and Souvaine (1987). 

The duality transform T (see Figure 1) maps a point P = (a,5) to a line 
T{P) : y = ax b. We need to define the image of a line I : y = cx d under 
transform T consistently. Let us pick points Q — {q,cq d), R — (r, cr + d) lying 
on line 1. T{Q) : y = qx A- {cq + d) and T{R) : y = rx A- {cr A- d). Both of these 
lines pass through the point (— c, d). Every other point on I will also be mapped 
to a line passing through (— c, d). Consequently we say that T{1) : {—c^d). 

Note that the slope of I is preserved in the x-coordinate of the image point. If 
the slope of line I exceeds the slope of line fc, then T{1) lies to the left of T(fc). The 
vertical distance of a point P = (a, b) to a line I : y = cx A- d equals b — {ca + d). 
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In the dual, the vertical distance from the line T{p) : y = ax b to the point 
T{1) = (— c, d) is (— ac + 6) — d, the same value. Hence, the transformation T 
preserves slope, vertical distance and the above/below relationship. 

For a vertical line I \ x = m^T maps it to a pencil of parallel lines of slope m. 
We assume that these parallel lines intersect at a point oom (this point is called an 
improper point). For every direction m, there is an improper point oc^, associated 
with it. Conversely, each line y = mx + a which passes through 00 ^ gets mapped 
to a point (— m, a) which lies on the line x = —m. Consequently, the image of the 
improper point oOm under T is the vertical line x = —m. 

The duality transform T can be extended to higher dimensions: in the 
dual of the point P = (a, 6, c) is the plane z = ax + by + c. 

2.2. Least Median of Squares Regression 



T(x) T(A) 




Figure 2. LMS regression: A set of points and its dual arrange- 
ment (only some of the lines in the arrangement are drawn), and 
the LMS slab and regression line. Line I passes through points A 
and P, and its dual point T{1) is the intersection of lines T{A) 
and T{B). The slab created from line I and the line parallel to 
it passing through point C contains 6 points. In the dual, the 
vertical lines between T{1) and the line T{C) crosses 6 lines. 

Consider the problem of fitting a line to a set of data points. The familiar 
ordinary least squares (OLS) method minimizes the sum of the squares of the y- 
distance between the fitted line and the data points (the residual). This method 
has the disadvantage that a single corrupt point (outlier) can significantly perturb 
the fitted line. The Least Median of Squares (LMS) Regression line (Rousseeuw, 
1984) is the line that minimizes the median of the squares of the residuals. This 
method has a high breakdown point as up to 50% of the data points can be outliers, 
without perturbing the fitted line. 

It is easy to prove that the problem of finding the line I that minimizes the 
median residual is equivalent to the problem of finding the slab bounded by a 
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pair of parallel lines of minimum vertical separation that contains [f ] of the data 
points. Furthermore, one bounding line Li must pass through two points and the 
other line L 2 through a point whose x-coordinate lies between those of Li’s points. 
To find the slab, one can check each pair of points (p, q) and find a line I parallel 
to them, passing through some other point r, such that precisely points lie 
in the closed slab defined by pq and L This can be solved naively using an O(n^) 
time algorithm. 

Mapping the set of points using the transform T, the above problem becomes 
one of looking for the intersection point of two lines T{p) and T{q) such that the 
p-distance between this point and the line T(r), which is above or below, is 
the shortest among all such distances. By sweeping (see Section 2.3) a line through 
the arrangement dual to the set of points, one can check all intersection points and 
lines such that the closed vertical segment from the point to the line intersects \^~\ 
lines and is as short as possible. 

Souvaine and Steele (1987) proposed an 0{n^ logn) time algorithm for com- 
puting the LMS line in using the duality concept described above. Their result 
was improved by Edelsbrunner and Souvaine (1990) to an O(n^) time algorithm, 
by using a topological line sweep (see Section 2.3). A practical approximation al- 
gorithm for 2 and higher dimensions, using additional computational geometry 
concepts, was recently presented by Mount et al. (1997). 

2.3. Line Sweep, Topological Line Sweep and Levels in an Arrangement 

Vertical line sweep (Bentley and Ottmann, 1979) is a classical technique in com- 
putational geometry. The algorithm sweeps a vertical line across an arrangement 
of objects (points, lines, segments, . . . ) from left to right, reporting all intersec- 
tion points, in a series of elementary steps. For an arrangement of n line segments 
consisting of k intersection points this technique achieves a time complexity of 
0((n -h k) logn) and requires 0(n) space (Brown, 1981). For an arrangement of n 
infinite lines that contains k = = 0{in?) intersections, vertical line sweep 

could report all intersection pairs sorted in order of x-coordinate in 6 (n^ logn) 
time and 0(n) space. If one needs to report the intersection points of the lines 
according only to a partial order related to the levels in the arrangement greater 
efficiency is possible using topological sweep (Edelsbrunner and Guibas, 1989). In 
topological line sweep, to report all the intersection points of the lines, a topolog- 
ical line, which is monotonic in the ^-direction but not necessarily straight, and 
which intersects each of the n lines in the arrangement exactly once, sweeps the 
arrangement, in a series of elementary steps, in 0{n?) time and 0{n) space. 

The fcth level of an arrangement of lines is the set of points that lie on lines 
and have at most k — 1 lines above them and at most n — k lines below them 
(see Figure 3). Gonsider an arrangement that was created from a set of points, 
using the duality transform T. If a point T(p), the image of line p, that lies on 
line I — T{L), is in the kth level, then line p has at most k - 1 points above it 
and at most n — k points below it. This property will be used in Section 3.1.1, to 
compute half-space depth contours. 
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3. Current Applications 

3.1. Half-space Depth and Half-space Depth Contours 

The half-space depth (in the literature this is sometimes called location depth or 
Tukey depth) of a point x relative to a set of points S = is the 

minimum number of points of S lying in any closed half-space determined by a 
line through x (Hodges, 1955; Tukey, 1975). 

Using the duality transform T, the set of points S maps to an arrangement 
of lines T{S). The number k of points of S above (resp. below) a line L through 
X, equals the number of lines of T{S) above (resp. below) the dual point T{L). 
The depth of a point x relative to S is the minimum number of points of S lying 
in any closed half-space determined by a line through x (above or below the line 
through x). Consequently, the depth of a line T(x), dual to the point x, relative 
to T(<S), is the minimal number of lines of T{S) above or below T(x). 




Figure 3. Levels in an arrangement and half-space depth com- 
putation: The set of 7 points ^4— G on the left is transformed using 
the duality transform T into the arrangement of lines T{A) — T{G) 
on the right. The levels 1, 2, 6, 7 are drawn in blue, green, yellow 
and red, respectively. Any vertical line cutting the kth level will 
pass through exactly A: — 1 lines of the arrangement above it. The 
intersection points on the 1st and 7th level are candidate points 
for the first half-space depth contour (as are the 2nd and 6th 
levels for the second depth contour). Therefore transforming the 
intersection points on each level back to the primal plane (as lines) 
and computing their intersection will result in that depth contour. 

The 1st and 2nd contours are drawn as grayed areas. 

3.1.1. Efficient computation of Depth Contours using Duality. The kth depth con- 
tour for a set of points S in is the boundary of the points of R^ with depth 
> fc, according to the chosen depth function. 
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The kth depth contours for half-space depth in can be constructed by 
taking the intersection of all half-planes containing k points of S, Specifically, the 
significant half-planes are the ones bounded by lines connecting pairs of points of 
S. Every line L passing through a pair of points /, m is dualized to an intersection 
point / of two lines T(/),T(m). The vertical line through / in the dual intersects 
every line of the arrangement, some above I and some below. The number of 
lines intersected above (resp. below) I equals the number of points lying above 
(resp. below) T~^{I) = L m the primal. The contour to which L contributes is 
the minimum of these two numbers (this is also said to be the level of L, see 
Section 2.3). Examining all intersection points in the arrangement of dual lines 
will produce a list of half-planes/intersections for each potential contour. 

Miller et al. (2003) presented an O(n^) time algorithm for computing all 
depth contours for a planar data set, using the idea sketched above. The authors 
used topological sweep (see Section 2.3) to examine all intersection points and to 
report the level of each intersection point in optimal time and space. 

Note that the duality concept and the algorithmic idea can be easily extended 
to higher dimensions. For example, in R^ the half-space depth of a point x is the 
minimum number of points of a given set S lying in any closed half-space bounded 
by a hyperplane (instead of a line, as in R^) through x. 

3.1.2. Overview of Related Results. The half-space depth of a single point can 
be computed in O(nlogn) time as shown by Rousseeuw and Ruts (1996). This 
matches the lower bound that was recently proved by Aloupis et al. (2002). 

Algorithms for computing the deepest point relative to the data set (also 
known as the Tukey median) were introduced by Rousseeuw and Ruts. In Ruts and 
Rousseeuw (1996), they give an 0(n^ logn) time implementation to compute a sin- 
gle contour. In Rousseeuw and Ruts (1998), the same technique is applied repeat- 
edly to give an 0{n^ log^ n) time implementation to compute the two-dimensional 
Tukey median. A fast implementation to approximate the deepest location in high 
dimensions is given in (Rousseeuw and Struyf, 1998; Struyf and Rousseeuw, 2000; 
Verbarg, 1997). Theoretical complexity results on two-dimensional depth contours^ 
or k- hulls, have been known for some time (Cole et al., 1987). The best known the- 
oretical result for computing the 2D Tukey median is an 0(n log^ n) algorithm by 
Matousek (1991), but its complex structure makes it an impractical candidate for 
implementation. The improved O(nlog^n) algorithm by Langerman and Steiger 
(2003) uses parametric search, also difficult to implement. 

Ruts and Rousseeuw (1996) developed a program called ISODEPTH, that 
computes the kth contour in 0(n^ logn) time and all contours in 0{n^ logn). John- 
son et al. (1998) gave a program called FDC to compute the k outermost depth 
contours which outperforms ISODEPTH for small ks. The 0(n^) time implemen- 
tation by Miller et al. (2003) that was described above computes all the depth 
contours, the depth of all the data points, and the Tukey median. The implemen- 
tation was expanded by Rafalin et al. (2002) to handle degenerate data sets, that 
contain 3 or more points on a line or points that share the same a;-coordinate. 
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A different approach based on parallel arrangement construction by Fukuda and 
Rosta (2003) allows to compute high dimensional depth contours. In addition, half- 
space depth contours can be computed for display in 2D using hardware assisted 
computation, suggested by Krishnan et al. (2002). 

Centerpoints^ which are points of depth > n/(d + 1) (by Kelly ’s theorem 
(Edelsbrunner, 1987), such points are guaranteed to exist), have been widely stud- 
ied in the computational geometry literature (Clarkson et al., 1996; Jadhav and 
Mukhopadhyay, 1994; Matousek, 1991; Naor and Sharir, 1990). There are efficient 
algorithms for centerpoints in by Matousek (1991) and in linear time by Jadhav 
and Mukhopadhyay (1994). Naor and Sharir (1990) give an algorithm in E^. The 
best known time in high dimensions is 0(n^+^), but a faster approximation algo- 
rithm is given by Clarkson et al. (1996). The algorithms are difficult to implement. 

3.2. Simplicial Depth 

Simplicial depth was introduced by Liu (1990): The simplicial depth of a point 
X with respect to a data set S = {Ai, . . . , is the fraction of the closed sim- 
plices formed by d -f 1 points of S containing x, where I is the indicator function: 
SDLiu{S\x) = (see Figure 4(a)). 

A simplex in E^ is the boundary of the region defined by d -h 1 vertices in 
general position, and represents the simplest possible polytope in any dimension. 
In one dimension, a simplex is simply a line segment; in 2 dimensions, it is a 
triangle; and in 3 dimensions, a tetrahedron. Simplicial depth has been studied by 
computational geometers since it was first introduced (Khuller and Mitchell, 1990; 
Gil et al, 1992). The simplicial depth is computationally more difficult in the finite 
sample case than some other depth measures. Recently new results and concepts 
were discovered by computational geometers, some discussed below. 

3.2.1. A Revised Definition. Several problems arise in the finite sample case of 
simplicial depth under Liu’s definition. As suggested by Zuo and Serfling (2000), a 
depth function should attain maximum value at the center {the maximality prop- 
erty) and as a point x moves away from the deepest point along any fixed ray 
through the center, the depth at x should decrease monotonically {monotonicity 
property). However, the simplicial depth function for the finite sample case fails 
to satisfy these properties (Zuo and Serfling, 2000) (see, e.g.. Figure 4(b)). In ad- 
dition, the depth of points on facets causes discontinuities in the depth function: 
The depth of all points on the boundary of a cell is at least the depth of a point 
on the interior of the cell. In most cases the depth values on the boundaries can 
be higher than the depth in each of the adjacent cells (see, e.g.. Figure 4(b)). 

Burr et al. (2003) proposed a modification to the definition that corrects 
the irregularity at boundaries of simplices by making the depth of a point on the 
boundary between two cells the average of the depth of the two cells and fixes the 
counterexamples provided by Zuo and Serfling (2000): 

Given a data set S = {Xi, . . . , Xn} in E^, the simplicial depth of a point x is the 
average of the fraction of closed simplices containing x and the fraction of open 




Computational Geometry and Statistical Depth Measures 



291 




B 





Figure 4. (a) Computation of simplicial depth, according to 
Liu and according to Burr et al. SDnui^i) = SDbrs{xi) = 1, 
SDLiu{x2) = 1, SDbRs{x2) = -5. SDiiuixs) = SDbRs{x3) = 0. 
(b) Problems with simplicial depth: The total number of sim- 
plices is (3) = 10. The depth of the open regions is drawn 
on the figure. SDiinip) = (2)/!^ — for p = {A,B,C,D}, 
while SDLiu{E) = .8. For a point x on AE SDuuix) = .5, 
violating the monotonicity property and causing discontinuities 
in the depth function at edges. According to the revised defini- 
tion SDbrs{p) = -3, for p = SDbrs{E) = .5 and 

SDbrs{^) = -35. (c) A problem with the revised definition. The 
data points A,B, and C all have depth (241 + ^ (2^))/ (3^) — 
and the data point D, which is at the unique center of the data 
set has depth (5^ -h ^ (2^))/ (3^) = Fur clarity reasons not all 
cells are drawn. 



simplices containing x: 

SDbrs{S; = 2 (d + 1) ,-Xi^^^]) + I{x€int{s[Xi, j))) 

where int refers to the open relative interior (see Edelsbrunner, 1987, page 401) 
of S[Xi^^ . . . Equivalently, this could be formulated as: SDbrs{S]x) = 

p(<S,x) -h ^a{S^x) where p{S^x) is the number of simplices with data points as 
vertices which contain x in their open interior, and <j(5, x) is the number of sim- 
plices with data points as vertices which contain x in their boundary. 

The revised definition reduces to the original definition for continuous distri- 
butions and for points lying in the interior of cells, and it maintains the ranking 
order of data points. In addition, it can be calculated using the existing algorithms 
(see Section 3.2.2), with slight modifications. However, it does not achieve all de- 
sired properties in the sample case. Figure 4(c) shows an example where the data 
set has a unique center, D, but it neither attains maximality at the center, nor 
does it have monotonicity relative to the deepest point. 

In addition, data points are still over-counted. For example data points in 
are a vertex of (^2 ^) simplicies, whereas edges are counted only n - 2 times. This 
implies that the weight A of a data point should be |(n — 2) = ^(^^2 A = 
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However, this factor isn’t enough: consider a data set of n points, where n — 1 
points are evenly distributed angularly around one point. Then the depth of the 
center point should be at least as large as the depth of the n cells which use it as 
a vertex; this is not guaranteed by the A factor. Thus the depth of a data point 
in should be based both on the A factor and the geometry of the n — 1 other 
data points. 

3.2.2. Overview of Related Results. The simplicial depth of a point in can 
be computed in O(nlogn) time, as proposed independently by Gil et al. (1992), 
Khuller and Mitchell (1990), and Rousseeuw and Ruts (1996). This bound matches 
the lower bound, as proved by Gil et al. (1992) and Aloupis et al. (2002). In 3 
dimensions, Gil et al. (1992) gave an O(n^) algorithm for computing the simplicial 
depth of X relative to S that was recently improved by Cheng and Ouyang (2001). 
Cheng and Ouyang also give a generalization of this algorithm to E^, with a time 
complexity of O(n^). Rousseeuw and Ruts (1996) proposed an O(n^) algorithm 
for E^, but some missing details may not be resolvable (Cheng and Ouyang, 2001). 
For space higher than 4 dimensional, there are no known algorithms faster than 
the straightforward method (generate all simplices and count the number 

of containments). The depth of all n points in E^ can be computed in O(n^) using 
the duality transform (Gil et al., 1992; Khuller and Mitchell, 1990). The simplicial 
median is the point with the highest simplicial depth. Aloupis et al. (2003) showed 
that in 2 dimensions it can be computed in O(n^). Their method was slightly 
improved by Burr et al. (2003). 

No known algorithm exists for the computation of simplicial depth contours, 
apart from the straightforward one. Recently an approximation method, using 
local information about the depth function (using a discretized version of the 
gradient, the vector where simplicial depth of positions is increasing most rapidly) 
was proposed by Burr et al. (2003). 



4. Future Work 

Collaborations between computational geometers and statisticians on data-depth 
measures have produced more efficient algorithms and implementations with sig- 
nificant speedup. Some problems are solved in theory by algorithms with efficient 
asymptotic running times (see Sections 3.1.2, 3.2.2) where, in practice, the hidden 
constants make implementations infeasible, prompting the need for continued re- 
search and implementable solutions. In addition, most of the algorithms and their 
implementations work for the low dimensional case. High dimensional data sets 
pose more interesting research questions. Exact computation of the depth func- 
tions will probably never be efficient enough, as the complexity of the algorithm 
is in most cases exponential in the dimension of the data. This calls for more 
approximation algorithms, that can compute depth functions and depth contours 
efficiently, with provable error bounds. 
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5, Summary 

The computational geometry community is thirsty for new and challenging open 
problems with real applications; the statistical community clearly needs improved 
computation speed in order to handle the large data sets of today. Much is to be 
gained by increased collaboration and solutions that are not only provably efficient 
but also effective in practice. 
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Abstract. The covariance matrix is a key component of many multivariate 
robust procedures, whether or not the data are assumed to be Gaussian. We 
examine the idea of robustly fitting a mixture of multivariate Gaussian densi- 
ties in the situation when the number of components estimated is intentionally 
too few. Using a minimum distance criterion, we show how useful results may 
be obtained in practice. Application areas are numerous, and examples will 
be provided. 
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1. Introduction 

The problem of identifying and handling outliers is an important problem in data 
analysis; see Barnett and Lewis (1994) and Huber (1981), for example. Finding 
outliers through interactive graphical exploration of small multivariate sets can 
be accomplished using Swayne, Cook, and Buja’s XGobi system (1998); however, 
with very large medium-dimensional data sets nonparametric density algorithms 
such as Scott’s ASH (1992) can be recommended. Of course, purely graphical and 
nonparametric approaches to identifying outliers are prone to errors since many 
smooth densities give rise to data that may appear to have outliers, but do not. 
The Cauchy distribution is one example of such a density. 

Thus the identification of outliers without an explicit probability model 
should always be viewed as preliminary and exploratory. If a probability model is 
known, then the tasks of parameter estimation and outlier identification can be 
more rigorously defined. However, even probability models are usually known only 
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approximately at best, and hence outliers so identified are still subject to certain 
biases. 

The assumption of a multivariate normal model is most common. Often data 
may be transformed so that normality holds approximately. Successful robust es- 
timation of the mean and covariance allows one to tag outliers more than three 
or four standard units from the mean. Finding the correct shape of the covariance 
matrix is the more challenging and interesting task. This shape can be sought 
without assuming normality. The minimum volume ellipse (MVE) has been inves- 
tigated (Rousseeuw, 1985; Rousseeuw and Leroy, 1987; Poston et al., 1997; Cook 
et al., 1993; Woodruff and Rocke, 1994). Finding the MVE exactly is a challeng- 
ing combinatorial problem but reasonably good approximate solutions exist. The 
MVE approach is especially appealing when the number of outliers is nontrivial 
and/or the outliers themselves form clusters. 

Another approach that can be quite successful in this setting was proposed 
by Aitkin and Wilson (1992), who investigated fitting a gaussian mixture model 
for xeMP 

K 

/(x) = Sfc) (1.1) 

k=l 

using the expectation- maximization (EM) algorithm (Dempster et al., 1980). This 
model is often used for other purposes such as cluster analysis (McLachan and 
Peel, 2001). Here we treat the outliers as members of “nuisance” clusters. Once 
successful estimation is achieved, we re-order the labels so that the cluster weights 
are in decreasing magnitude, w\ > W 2 > ' " > wk- Then in many situations, we 
may view the smallest K — 1 clusters as representing various kinds of outliers and 
outlying clusters, with (u;i,/ii, Ei) being the parameters of interest. For example, 
the fraction of outliers would be estimated as being 1 — wi. 

Getting the EM algorithm to work requires a number of steps. Choosing 
K is especially tricky since many of the outliers may be singletons. Even more 
difficult is getting good initial guesses for the mixture parameters, especially when 
K is much larger or smaller than the “correct” model. Finally, small clusters 
cannot support estimation of a full covariance matrix, so that special structure 
may be assumed, such as a diagonal form or pooling of covariance information. 
Nevertheless, as a conceptual framework for handling large numbers of outliers in 
a range of challenging situations, the mixture model has great intuitive appeal. 

A practical concern with the Aitkin and Wilson approach is that while a 
normality assumption may make good sense for the “good” data, knowledge of 
the distribution of the outliers and outlier clusters is more suspect. Since the 
primary parameters of interest are (^^;l,/il, Ei), a procedure which estimates only 
that component would be of great interest. In one sense, the MVE approach is 
capable of this task. In this paper, we propose an alternative estimation approach 
to EM that can also estimate only a subfraction of the components of the mixture 
model. Applications are numerous. 
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2. Mixture Estimation by Minimum Distance 



Practical algorithms for estimating parameters in a model f{x\6) by minimum dis- 
tance have been discussed by Beran (1977), for example, by minimizing Bellinger 
distance between the model and a kernel density estimate. Such an indirect ap- 
proach is not feasible for a parameter-rich model such as the mixture model (Scott, 
1999). One distance criterion, integrated squared error (ISE), affords better nu- 
merical properties. ISE has been the criterion of choice for nonparametric function 
estimation (Scott, 1992), and Rudemo (1982) showed how to formulate a cross- 
validated estimate of ISE for histograms using a leave-one-out approach. Terrell 
(1990) first proposed using the ISE as an alternative to maximum likelihood, with 
extensions by his student Kim (1995). The use of ISE for parametric estimation has 
been described (Hjort, 1994; Scott, 1999; Scott and Szewczyk, 2001; Wojciechowski 
and Scott, 1999; Scott, 2001). A more general divergence criterion (which includes 
ISE) has been described by Basu et al. (1998). Mixture estimation by ISE has 
been discussed in Scott (1999). 

In the usual presentation, we seek to find the true parameter, from the 
parametric model, f{x\0). In our case, our estimate will be of the form f{x\0), but 
the true density, g{x), will not (necessarily) be from the parametric family, f{x\0). 
Nevertheless, we seek to find the value of 0 such that f{x\0) is closest to g{x) in 
the sense of integrated squared error. 



0 = arg min^ 


j [f{x\9) - g{x)f dx 






= arg min^ 


j f{x\0)^dx “ ^ y* f{x\0)g{x)dx -h J g{x)^dx 


= arg min^ 


j f{x\efdx-2E[f{X\9)] 


5 



since f g{x)‘^dx is a constant in the second line with respect to choice of 0, and 
where X is a random variable from the true density, g{x). Now for many models 
(including the normal mixture model) , the first integral can be computed in closed 
form for any value of the parameter vector, 0. Given a random sample, an unbiased 
estimate of E[f{X\0)] is the mean. Thus a completely data-based version of the 
ISE criterion is given by 



0 = arg min^i 



J f{x\0fdx-^'^f{xi\9) 



( 2 . 1 ) 



In practice, this nonlinear optimization problem falls in the well-studied class of 
M-estimators (Hjort, 1994; Basu et al., 1998; Scott, 2001), and 0 is often asymp- 
totically normal. Scott (2001) calls the value of 0 which minimizes (2.1) the L2E 
estimator, since integrated squared error and the L 2 distance are equivalent. 

A careful examination of the argument leading up to (2.1) reveals that the 
density model, f{x\0)^ need not be a density function, whereas the fact that g{x) is 
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a density was critical in estimating E[f{X\6)]. Thus we may consider an incomplete 
mixture model for f{x\6). The simplest such model is the multivariate partial 
density component (MPDC) given by 



/(x|<9) =u;0(x|/i,E), 

where 9 = E). For the MPDC (2.1) is given explicitly by 



9 = arg min^i 



n 



w 



i=l 



Spins code that finds 9 is available from the author. 



( 2 . 2 ) 



(2.3) 



3. Bivariate Examples 

A thorough simulation study of robust properties often requires a book; see the 
Princeton robustness study (Andrews et al., 1972), for example. Here we show a 
sample of bivariate examples of normal mixtures where the number of components, 
ranges from two to five. We try to avoid any hidden assumptions, such as 
symmetry of outliers. There are no singleton outliers. In fact, most outliers are in 
clusters with at least 50 or 100 points. The main central cluster has 500 or 1000 
points. 

In the first five and eighth examples, the ‘‘true” mixture component is located 
at the origin, fi = (0,0)^, with E = / 2 , the identity matrix. In the bivariate case, 
the parameter estimated is six-dimensional: 

9 = fix^ l~ly^ O’xx’) ^xy’) 

The iteration was started with the maximum likelihood estimates of the parameters 

= (1, X, y, si, rSxSy, slY' . 

Both initial and final estimates of 9 are displayed graphically in each figure as 
one-sigma ellipses. 

Here are the particulars of the mixture samples for the four examples dis- 
played in Figure 1: 

(i) K = 2,ni= 500, ri2 = 100, /i 2 = (5, 5)^, E 2 = h] 

(ii) = 3, ni = 500, U2 = = 100, fi2 = (5, 5)^, /is = -/i 2 , E 2 = E 3 = h] 

(iii) K = 2, ni= 500, 7i2 = 100, /i 2 = (0,0)^, E 2 = 25 h] 

(iv) K = 2, m= 500, n 2 = 100, /i 2 - (7, 7f, ^2 = 9h. 

In the first frame of Figure 1, 100 points are centered at (5, 5)^. The estimated 
value w = 81.5%, which is very close to 500/600. Notice the lack of any visual 
correlation in the final estimate Si. 

In the second frame of Figure 1, two groups of 100 outlying points are centered 
at (5,5)^ and (-5,-5)^. The estimated value w = 68.3%, which is very close 
to 500/700. Again, the apparent correlation in the initial maximum likelihood 
estimate of E is correctly missing in the L2E estimate. 
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In the third frame of Figure 1, the 100 outliers share the same center but 
their scale is 5 times larger. Given the overlap of “outliers” and “inliers,” w is 
biased upwards. However, observe how Ei is much closer to / 2 . 

In the fourth frame of Figure 1, which is a hybrid of frames one and three, the 
100 outlying points are centered at (7,7)^ and their scale is 9 / 2 - The estimated 
value w = 84.1%, which is very close to 500/600. Notice the lack of any correlation 
in the final estimate of Ei. 





Figure 1 . Multivariate partial density component estimation ex- 
amples. The maximum likelihood one-sigma ellipse is shown as a 
dotted line, while the L2E ellipse is shown as a thick solid line. 
w is also given. The origin is at the intersection of the two axes. 
See text for true parameter values. 



Our second set of examples are displayed in Figure 2. In two of these, the 
true correlation coefficient of the main cluster is p = 0.7. Here are the particulars 
of the mixture samples for the four examples displayed in Figure 2: 

(v) K = A, Til = 1000, ri 2 =713 = 714 = 100, P 2 = (6,5)^, /X 3 = -fl 2 , A ^4 = 
(5, -5)^, S 2 = S 3 = S 4 = I 2 ] 

(vi) K = 2, ni = 500, Pi = 0.7, 7^2 = 100, /12 = (5, 5)^, S 2 = h] 
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(vii) K = 2, ni = 500, pi = 0.7, ri 2 = 100, p 2 = (5, -5)^, E 2 = h\ 

(viii) Multiple starting values with — 5, ni = 1000, ri 2 = 50, = 100, n 4 = 150, 

ri5 = 200, = (5,5f, //3 = (-5,-5f, /i4 = (5,-5f, p, = (-5,5)^, 

E2 = S3 = E4 = E5 = /2. 

In the first frame of Figure 2, three outlying clusters of 100 points each 
surround 1000 points at the origin. The estimated value w = 74.9%, which is very 
close to 1000/1300. 





1 


0.749 












0.841 








S 


... 




'.1 ' : - 



0.822 I 




Figure 2. Second set of multivariate partial density component 
estimate examples. In the final frame, solutions for five different 
initial guesses of 6 are displayed. See text for true 6 values. 

In the second frame of Figure 2, the 500 center points have a correlation of 
p = 0.7. The group of 100 outlying points are centered at (5, 5)^, which is in line 
with the main axis of the covariance matrix. Here, the apparent correlation in the 
initial maximum likelihood estimate of E is correctly retained in the L2E estimate, 
and the center is correctly estimated, too. 

In the third frame of Figure 2, the main cluster is identical, but now the 
100 outliers are centered at (5, —5)^. Thus the initial covariance estimate shows a 
strong, but negative, correlation, when the true correlation is strong, but positive. 
The L2E estimates properly reorient the covariance matrix. 
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The fourth frame of Figure 2 is similar to the first, but with four outlying 
clusters (of different size 50, 100, 150, and 200). Starting from the maximum 
likelihood estimates, a L2E MPDC component of size w = 0.66 is found (actual 
value is 1000/1500). 

Of special interest are the four other L2E MPDC estimates displayed. These 
were found using different starting values for 6 that were closer to the true param- 
eters of each outlying cluster. This confirms the earlier suggestion that the “size” 
of a component is not as important as how separated the component is from the 
remaining data. The smallest cluster found (in one of five runs) has w = 0.034, 
which is very close to 50/1500. 

We have run a number of limited experiments in higher dimensions. The 
algorithm initially failed to converge on the Fisher Iris data. Closer examination 
revealed the reason to be that the data only had one decimal point. Adding a small 
amount of random noise (blurring) fixed the convergence problem. 

Examples similar to those in Figure 1 were run in dimensions up to seven, with 
full covariance estimation. In seven dimensions, that amounts to 1 + 7 + 28 = 36 
quantities in the parameter vector, 6. An unfortunate aspect of the choice of in- 
tegrated squared error is that ISE is not dimensionless (as is Hellinger distance, 
Li distance, maximum likelihood). The practical implication of this fact is that 
numerical optimization does not behave well due to scaling issues in these dimen- 
sions. Gaining a better understanding of this phenomenon and extending the range 
of dimensions where L2E may be applied should be a valuable area of research. 



4. Regression 

The extension of L2E to regression 

Vi =xf/3 + Ci 

is described and illustrated in Scott (2001). The algorithm is driven by a normal 
assumption on the residuals, ~ Ar(0, a‘^). In our experience, we often see clusters 
of outliers in the regression plot, or even multiple regression curves mixed together. 
Here we briefly describe the extension of the MPDC approach to regression. The 
beauty of the regression setting is that the error distribution is univariate even 
with p predictor variables. Thus the MPDC is given by this assumption for the 
error variable: 

The parameter vector estimated is 6 = {w^(3,(Je)^ , which is of length p + 3, as- 
suming the parameter vector /3 includes an intercept term, /?o- 

We illustrate an application to a well-known set of data first described by 
Harrison and Rubinfeld (1978) on the median house value in census tracts in 
Boston, following the usual transformation of the p = 13 variables. The residuals 
from the least-squares fit were smoothed using a kernel estimate, as shown in the 
first frame of Figure 3. Also shown is a N{0^a‘^) curve, with residual variance 
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estimated in the usual manner. The residual distributions are not too close, but 
only a few outliers are apparent. 

We next fit the L2E model with noise distribution w • N{0^a‘^). The kernel 
smooth of the residuals and the estimated MPDC are both drawn in the right 
frame of Figure 3. Here, w = 84.5%. Interpreting and diagnosing the fit is much 
easier with the L2E fit. For example, while there are some outliers on the low end, 
almost 14% of the outliers appear on the high side. In a more complete analysis 
where the outlying census tracts are interactively displayed on maps, we can learn 
that these tracts fall in certain regions (some along the Charles River) and not at 
random; see Scott and Christian (2003). 



LS residuals robust residuals with wt=0.845 





Figure 3. Kernel estimate of estimated residuals and fitted nor- 
mal error density for the least squares fit (left frame) and the L2E 
fit (right frame), for the Boston housing data set. 



5. Discussion 

The PMDC approach is useful for outlier detection in many situations, and for 
clustering in particular; see Scott and Szewczyk (2004). The concept of break- 
down is more complicated in this setting, as the algorithm is local. That is, the 
algorithm converges to local normal clusters for certain ranges of initial parame- 
ter settings. We have carefully examined the attractiveness of a single univariate 
component as a function of the parameters of an adjacent component. Generally, 
if the components overlap to a significant amount, the algorithm may not converge 
to either of the separate components, but rather to a single large component. 

We have found a way to overcome this limitation by optimizing over a subset 
of the MPDC parameters. In Scott and Szewczyk (2004), numerous random initial 
guesses are used and the collection of parameter estimates, 0, is clustered to find 
the most common solutions. We take these to be cluster locations. 

Models other than normal may be chosen, however, the closed form expression 
of the L2E criterion is very convenient for optimization. If the main cluster is not 
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approximately normal, the L2E solution may find the cluster, but the estimated 
parameters will vary depending upon the degree of non-normality. 

There are numerous other extensions which we only touch upon here. For 
example, the MPDC may be formulated with more than one mixture component, 
or sequentially with some fixed mixture components, but then getting good ini- 
tial estimates for 6 becomes more difficult. Other regression applications may be 
formulated, including image processing tasks. We describe these separately. 
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Robust Fitting Using Mean Shift: 
Applications in Computer Vision 

D. Suter and H. Wang 



Abstract. Much of computer vision and image analysis involves the extraction 
of “meaningful” information from images using concepts akin to regression 
and model fitting. Applications include: robot vision, automated surveillance 
(civil and military) and inspection, biomedical image analysis, video coding, 
human-machine interface, visualization, historical film restoration etc. 

However, problems in computer vision often have characteristics that 
are distinct from those usually addressed by the statistical community. These 
include pseudo- outliers: in a given image, there are usually several populations 
of data. Some parts may correspond to one object in a scene and other parts 
will correspond to other, rather unrelated, objects. When attempting to fit 
a model to this data, one must consider all populations as outliers to other 
populations - the term pseudo-outlier has been coined for this situation. Thus 
it will rarely happen that a given population achieves the critical size of 50% 
of the total population and, therefore, techniques that have been touted for 
their high breakdown point (e.g., Least Median of Squares) are no longer 
reliable candidates, being limited to a 50% breakdown point. 

Computer vision researchers have developed their own techniques that 
perform in a robust fashion. These include RANSAC, ALKS, RESC and 
MUSE. In this paper new robust procedures are introduced and applied to 
two problems in computer vision: range image fitting and segmentation, and 
image motion estimation. The performance is shown, empirically, to be supe- 
rior to existing techniques and effective even when as little as 5-10% of the 
data actually belongs to any one structure. 

Mathematics Subject Classification (2000). Primary 68T45; Secondary 62F35. 
Keywords. Robust Statistics, model fitting, computer vision, breakdown-point. 
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1. Introduction 

Computer Vision involves the extraction of “meaningful” information from images. 
Subareas include: robot vision, automated surveillance (civil and military) and 
inspection. Moreover, the techniques involved in computer vision also find their 
way into a diverse range of other areas, such as biomedical image analysis, video 
coding, human-machine interface, visualization, historical film restoration etc. 

Many of the tasks to be carried out, for the purposes of computer vision, can 
be cast as forms of statistical estimation and fitting. This has lead to a strong 
community interest in statistical methods - particularly robust statistical meth- 
ods. Indeed, robust statistical methods in computer vision have a history, at least 
two decades old, using or adapting standard methods from the statistical com- 
munity: for example, M-Estimators (Black and Rangarajan, 1996) and Least Me- 
dian of Squares (Meer et ah, 1991; Bab-Hadiashar and Suter, 1998). At least one 
novel technique, RANSAC (Fischler and Bolles, 1981) was developed by vision 
researchers in the early days (and is still a widely used technique). 

However, problems in computer vision often have characteristics that are 
distinct from those addressed by the statistical community. These include: 

• Pseudo-outliers. In a given image, there are usually several populations of 
data. Some parts may correspond to one object in a scene and other parts 
will correspond to other, rather unrelated, objects. When attempting to fit 
a model to this data, one must consider all populations as outliers to other 
populations - the term pseudo-outlier has been coined (Stewart, 1995). Thus 
it will rarely happen that a given population achieves the critical size of 50% 
of the total population and, therefore, techniques that have been touted for 
their high breakdown point (e.g.. Least Median of Squares) are no longer 
reliable candidates from this point of view. 

• Unknown sizes of populations - and unknown location. Computer vision re- 
quires fully automated analysis in, generally, rather unstructured environ- 
ments. Thus, the sizes and locations of the populations involved, will fiuctu- 
ate greatly. Moreover, there is no “human in the loop” to select regions of 
the image dominated by a single population, or to adjust various thresholds. 
In contrast, statistical problems studied in most other areas usually have a 
single dominant population plus some percentage of outliers (typically mis- 
recordings - not the pseudo-outliers mentioned above). Typically a human 
expert is there to assess the results (and, if necessary, crop the data, adjust 
thresholds, try another technique etc.). 

• Large data sizes. Modern digital cameras exist with around 4 million pixels 
per image. Image sequences, typically at up to 50 frames per second, contain 
many images. Thus, computer vision researchers typically work with data 
sets in the tens of thousands of elements, at least, and data sets in the 10® 
and 10® range are not uncommon. 
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• Emphasis on fast calculation. Most tasks in computer vision must be per- 
formed ‘‘on-the-fly” . Offline analysis that takes seconds, let alone minutes or 
hours, is usually a luxury afforded by relatively few applications. 

These rather peculiar circumstances have lead computer vision researchers to de- 
velop their own techniques that perform in a robust fashion (perhaps ‘‘empirically- 
robust” should be used, as few have formal proved robust properties, though many 
trace their heritage to techniques that do have such proved properties). These in- 
clude ALKS (Lee et ah, 1998), RESC (Yu et al., 1994) and MUSE (Miller and 
Stewart, 1996). Results presented in those papers suggest that ALKS can be tol- 
erant to at least 65% outliers, RESC to about 80% outliers, and MUSE to 55% 
outliers. However, it has to be admitted that a complete solution, addressing all 
of the above problems, is far from being achieved. Indeed, none of the techniques, 
with present hardware limitations, are really “real-time” when applied to the most 
demanding tasks. None have been proved to reliably tolerate high percentages of 
outliers and, indeed, we have found with our experiments that RESC and ALKS, 
although clearly better than Least Median of Squares, in this respect, are not 
always reliable. For recent surveys, see Meer et al. (2000) and Stewart (1999). 

The purpose of this paper is to highlight recent research carried out by the 
authors, in developing techniques with robust behavior, to address various prob- 
lems in computer vision - specifically, range image fitting and segmentation, and 
image motion estimation. 



2. Basic Idea and Results 

Because of the presence of multiple structures in the image, we need approaches 
that are robust to (pseudo)-outliers in the sense of having a high breakdown point. 
Established techniques, in use, that meet this criteria are based upon random 
sampling: e.g.. Least Median of Squares, Least Trimmed Squares, and RANSAC. 
Random sampling techniques aim to explore the search space of possible solutions 
well enough to have at least one candidate which is determined solely by inkers 
(to a single structure in the data). However, since one does not have an “oracle” to 
tell us which of the candidates are unpolluted by outliers; we require some form of 
model/fit scoring. In Least Median of Squares, this is obviously the median of the 
residuals. In RANSAC, it is the number of data residuals inside a certain bound. 
Of course, each form of model scoring has potential weaknesses. In Least Median 
of Squares, the median of the residuals of the entire data set (with respect to the 
candidate model) will obviously not be a good measure if the inkers to that model 
contain less than 50% of the total data. Generalizing to the kth other statistic 
(rather than the Median) is one way out of this dilemma but now one either has 
to know in advance what value of k to use, or one has to attempt some form 
adaptation e.g., ALKS (which will perhaps be costly and limited in reliability). 
Even still, it is overly optimistic to expect a single statistic (the kth. order residual, 
for Least Median of Squares and ALKS; or the number of inkers within a certain 
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bound, for RANSAC) to be an entirely reliable/informative measure of the quality 
of a solution. 

This observation has lead the authors to seek alternative ways of scoring 
candidate models, so that greater robustness may be achieved. An early attempt 
(Wang and Suter, 2002) employed possible symmetry in the data set as one such 
statistic: though somewhat limited in versatility, such an approach definitely re- 
stores robustness in situations where standard Least Median of Squares will break 
down. Our more recent estimators seek to use more information from the residual 
distribution. In particular, we have used Kernel Density estimation and Mean Shift 
Techniques (Fukunaga and Hostetler, 1975) to formulate model/fit scores that lead 
to empirically observed higher breakdown points than all existing methods. 



2.1. Mean Shift and Kernel Density Estimation in Model Scoring 



Consider a candidate model fit (as part of a random selection and trial fitting as 
in the usual Least Median of Squares algorithm) . We seek to approximate the pdf 
of the residuals and find the mode of this distribution. 

Although other kernels could be employed, we use the Epanechnikov kernel 
in its 1-D form: 



K{x) 



|(1 - x ^) < 1 

0 otherwise 



( 2 . 1 ) 



The kernel density estimator is then: 






X: 



( 2 . 2 ) 



i=l 



for a data set of n residuals Xi and using “bandwidth” h. 

The mean shift is calculated by (Fukunaga and Hostetler, 1975): 

/(x) = Mh{x) = — {Xj - x) 



(2.3) 



where Sh is an interval of half- width h centered on a: - the “mean shift window” . 

Simple calculations show that iterating the above will converge to a local 
maximum of the estimated pdf. Because of the limited extent of the mean shift 
window (in the particular case of the kernel we choose, there is an inherent limit to 
the spatial infiuence of a data point, coming from the finite support of the Kernel 
- although a window could be imposed on other kernels not having finite support), 
this process is reasonably insensitive to outlier residuals. Comaniciu and co-workers 
(Comaniciu et al., 2002) have recently popularized this mean-shift technique for 
various applications in computer vision (e.g., clustering). Here we use the method 
as a way of scoring candidate fits. 

The basic notion is that the robust estimate should produce a strong peak in 
the pdf of the residuals for that fit, and that the value of the residual corresponding 
to that peak should be small (ideally zero, of course). We have experimented with 
several ways to encapsulate such notions and have found maximization of the 
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following objective function performs well: 



MPDE = 



E nxi) 

Xi€Wc 

exp(|Xc|) 



(2.4) 



where the mean shift procedure is used to find the mode of the residual density 
/ and to limit mean estimation to using only ‘finliers” Xi (with center value Xc) 
within the mean-shift window Wc centered on the mode of the pdf. This method 
is similar to the Residual Concensus (RESC) method (Yu et ah, 1994). Essentially 
that method estimates the pdf by using a histogram (whose bin size is chosen by 
compressing a large histogram of residuals using a heuristic procedure so that 12% 
of the residuals are found in the first bin). The RESC criterion is then: 



RESC = 






(2.5) 



where hi is the histogram value and Vi the residual of the ith bin. The value m is 
chosen by another heuristic (4.4% of hmax)i so as to exclude outliers. (Note: in Yu 
et ah, 1994, a = 1.3 and P = l.)ln essence, the scores given in equations 2.4 and 2.5 
differ in how one restricts attention to likely inliers (we use the mean shift window) 
and how one models the pdf (we use kernel density estimation and mean shift mode 
seeking). These differences lead to significant improvements in robustness. 



2.2. Examples 

In this section we show that approaches based upon these procedures can tolerate 
up to 90% or so of outliers (including pseudo-outliers) and outperform previous 
computer vision methods (MUSE, RESC, ALKS, RANSAC) and more widely 
known methods (Least Median of Squares and Least Trimmed Squares) in that 
regard. We demonstrate this with experiments using synthetic data and with “real 
life” data in the areas of: line and circle finding, range image segmentation and 
image motion estimation. 



2.2.1. Lines and Circle Finding. Various edge detection routines can be employed 
to try to find edges in a scene. Typically, these routines produce very noisy out- 
put (gaps in edges, isolated points etc. - see Figures 1 and 2) If one is seeking 
structures that have particular parametric forms (e.g., lines and circles) one can 
use regression type frameworks to find the structures of interest amongst the edge 
detector output. 

Thus this group of problems is similar to the standard regression problem: 
although we acknowledge that one should consider using geometric distance, rather 
than residuals, in the minimization (Kanatani, 1996). However, here, we avoid the 
non-linear theory and resultant approximations of geometric fitting, and apply 
our procedure to the residuals produced by substituting the data points into the 
defining parametric form of an circle or line. Note: as opposed to standard settings 
for regression, we have multiple structures. To find every line we could find any 
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■(a) 





(c) 



Figure 1. Line finding. 



line, first, then find the inliers to that line, remove them from the pool of data and 
re-apply the procedure to find the next line. Here we just demonstrate the first 
stage - finding any valid structure (though in segmentation, section 3, we do find 
all of the structures sequentially.) 

In computer vision, probably the most commonly used technique, for such 
problems, is the Hough Transform. Essentially, this transform discretizes the pa- 
rameter space and each data point “votes” for the parameters it is consistent with. 
Therefore, it produces an estimate of the parameters with a finite precision which 
is determined by the “bin” size in parameter space. In this respect, it is unlike the 
other methods we consider: all of the other methods produce estimates with po- 
tentially infinite precision (limited by round-off errors, and, of course, in accuracy 
rather than precision). Moreover, since the parameter space has to be discretized, 
the Hough transform is simple to apply when the parameter space is compact or 
known limits apply (e.g., as imposed by the finite boundaries of an image) but is 
less simple when the parameter space has infinite extent or the practical bounds 
on likely encountered parameters are less clear. For these reasons, despite the early 
and continuing successes of such a technique, we prefer to seek improved methods. 
Nonetheless, it is instructive to compare the results with the Hough transform. 

In addition to the Hough transform, we experiment with various established 
robust fitting methods Least Median of Squares, and several techniques not gen- 
erally known outside of the computer vision community: RANSAC, ALKS, and 
RESC. An example can be found in Figure 1. The pavement image 1(a) is pro- 
cessed with an edge detector 1(b). After various forms of robust line fitting are 
applied, we obtain the results typically as shown in 1(c). Least Median of Squares 
(light upper line) fails because it is not robust to some many outliers. ALKS (mid- 
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Figure 2. Cirle finding. 



die dark line) also fails. The remaining techniques (MDPE, RESC and Hough - 
lower dark line) all produce similar results on this example - as the discretization 
of the parameter space for the Hough transform is adequate, its limit of precision 
is not apparent. 

Similarly, examples of circle fitting can be found in Figure 2. The cups image 
2(a) is processed with an edge detector 2(b). After various forms of robust line 
fitting are applied, we obtain the results shown in 2(c). MDPE (finds bottom cup). 
Hough (finds cup above it) and RANSAC (finds next cup above) all produce robust 
fits. Least Median of Squares, ALKS and RESC all breakdown (larger circles in 
the figure). As a second example, a synthetic data set was created to resemble 
“Olympic rings” and then artificially corrupted with uniformly distributed noise 
samples - see Figure 2(d). In this example, the data that are “inliers” to any one 
ring constitute only about 5% of the data - that is, for a given fit, there are around 
95% outliers to that fit. MDPE (bottom right ring), RANSAC (top right ring) and 
the Hough transform (top middle ring) can all find a ring. However, Least Median 
of Squares, ALKS and RESC again breakdown. Note: it is not significant which 
structure is found. 

These and other similar experiments we have performed, show that algo- 
rithms based upon our MDPE criterion outperform other robust techniques (Least 
Median of Squares, techniques based upon least fcth order statistics such as ALKS, 
and the RESC approach). The technique is challenged for robustness only by 
RANSAC (which requires a priori knowledge of the expected sizes of the majority 
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of the inlier residuals) and the non-regression (limited precision) Hough transform 
(which also requires a well-chosen bin size for the parameter discretization). We 
must acknowledge that almost no method has an implementation that is free of 
certain parameters that have to be set. In our experiments, where there are such 
parameters, we have tried first to use the settings advocated by the authors of 
those methods (if, indeed, they are available to us). We have also tried to vary 
those parameters and to select the best performance delivered by a method over 
the range of parameters investigated. We believe that this is as fair as we can 
be in our comparisons. The essential parameter required for our technique is the 
bandwidth of the kernel density estimator (although there are the inevitable mi- 
nor bounds and tolerance parameters that plague code in order to guard against 
certain numerical limits, or to decide when to cease iteration). In the experiments 
reported in this section, we empirically investigated the behavior of our approach 
with varying bandwidth and found that overall performance, at least over a rea- 
sonable range of bandwidth choices, was reasonably stable. In the next section we 
present results where we have automatically chosen the bandwidth. 



2.2.2. Optic Flow. Optic flow is the apparent motion on the image plane 

caused by relative motion between an observer and the imaged points. If one 
assumes that the imaged point at position (x, y) and time t maintains a constant 
brightness /(x, y, t), as it moves under the optic flow, then a simple differentiation 
reveals to first order: 



dl dl dl 



(2.6) 



- the so-called “Optic Flow Constraint” (see, e.g., Bab-Hadiashar and Suter, 1998). 
Since this constraint provides only one constraint in two unknowns, it has been 
solved by assuming, within a small image patch, that the flow is of a simple 
parametric form (e.g., locally constant, or locally affine). Such low order local 
polynomial approximations can be justified in the usual manner - providing the 
flow is locally smooth. In fact, it can be shown that if the scene contains rigid 
planar patches, then to a good approximation, the flow will be an 8 parameter 
quadratic 



u = ax^ + bxy + cx + dy -|- e; v = by^ + axy f x gy -\- h (2.7) 



(i.e., having less degrees of freedom than the full quadratic defined over a 2-D 
domain.) 

Within each patch, we can measure the quantities , and || so that 

we can solve for the parameters of the assumed flow model. However, since the 
measurements can contain very large errors, and since the patches may straddle 
different flows (moving according to different parameters), we need methods of 
solution that are robust to large numbers of outliers (including pseudo-outliers). 
Robust statistical methods such as Least Median of Squares have been demon- 
strated as superior to Least Squares and other competing methods in this respect 
(Bab-Hadiashar and Suter, 1998). 
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Table 1. Yosemite Valley Sequence - Optic Flow. 



Technique 


Average Error 


Bab-Hadiashar and Suter (1998) 
QMDPE {a = 2.0, 25 x 25, m=30) 


1.94 

1.34 



We have applied our method, as described above but with some modifications, 
to the problem of optic fiow estimation (Wang and Suter, 2003a). Firstly, we have 
implemented a form of automatic selection of the kernel bandwidth. Using standard 
techniques (Wand and Jones, 1995): 



h = 



' 2A3R{K) ' 
_35u2{K)‘^n _ 



s 



(2.8) 



where R{K) = U 2 {K) = fli d^ and s is the sample standard 

deviation. This still requires the estimation of the sample scale in a robust fashion 
and, since the above is the recommended upper bound on /i, we also have to employ 
a multiplicative adjustment c/i for 0 < c < 1. Space does not permit us to go into 
details here, but we have investigated, with some success, both the use of order 
statistics for scale estimation and a rather more novel scheme of using a mean-shift 
like procedure to also find the pdf valley (thus providing a useful classification of 
inkers as those between the peak and the valley in the pdf). We have also, for 
speed reasons, modified the MDPE criterion to use what we call QMDPE: 



QMDPE = 



{Hxc)Y 

exp(|Xc|) 



(2.9) 



One sample of our results is provided in Table 1. To estimate the optic fiow 
at a point, we center a rectangular patch on that point and solve using all optic 
fiow constraints from that patch. The table reports an error measure (see Bab- 
Hadiashar and Suter, 1998, for a definition) averaged over every result. We compare 
against the closest (and to this point best) competing robust method, and to other 
methods available in the literature. Clearly QMDPE performs well. 



3. Segmentation 

Segmentation is the process of dividing the image (or a sequence of images such as 
a movie) into spatial and/or temporal structures of interest. In principle, one can 
use a parametric fitting method to sequentially find structures in the data, find the 
inkers to that fitted model, remove the inkers and repeat by fitting another model. 
Such approaches, using variants of Least fcth order fitting have been shown to be 
reasonably promising (Bab-Hadiashar and Suter, 1999; Bab-Hadiashar and Suter, 
2000). However, the implementation of such a relatively simple scheme becomes 
complicated by a number of practical issues that are beyond the scope of this paper 
and possibly of little interest to the intended audience - suffice to say that one can 
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devise methods, based upon robust fitting schemes, that outperform traditional 
and other competing methods. 

Consider for example, a range image, such as one collected via a laser range- 
finder, or stereo camera, or structured lighting technique: ultimately, though these 
devices differ in their principles, they all produce an “image” which is a set of 3D 
points sampled from some real world scene. The challenge is then to make sense 
of this mass of data points - to extract meaningful structures and surfaces. We 
have shown that one can segment a range image, into smooth parametric surfaces 
(Wang and Suter, 2003b). 

Likewise, using variants of Least fcth Squares, one can segment a movie se- 
quence purely on the basis of the motion (Bab-Hadiashar et ah, 2002). We are 
working on using MDPE and QMDPE on this task. It is worthwhile noting that 
to completely solve the problem of segmentation, in situations where the observed 
objects can have varying shapes and motions, one must not only use a robust fit- 
ting technique (and solve the attendant implementation issues), but one also needs 
to solve the model selection problem. Though the cited papers show some promis- 
ing investigations into that issue, robust model selection in this context remains 
the major (and daunting) problem. 



4. Conclusion 

In this paper we have adopted the philosophy that a single statistic, such as the 
fcth order statistic, or the number of inliers within a certain bound; is unlikely 
to be a sophisticated enough measure to reliably discriminate between candidate 
model fits in a robust procedure that has to cope with a wide range of possible 
outlier/pseudo-outlier populations. Instead, we have looked at characterizing the 
quality of a model fit by more complex measures of the residual distribution: 
capturing information such as how peaked around zero the residual probability 
density function is. To this end, we have devised procedures that, at their core, use 
kernel density estimation of the pdf and a mean-shift approach to located the peak 
of that pdf. Experiments have shown that such procedures can considerably out- 
perform existing robust techniques in terms of apparent breakdown point behavior. 

In concluding, we must remark on the shortcomings of the approaches we 
are hereby promoting. From a practical point of view, the methods are somewhat 
costly. Though we haven’t extensively studied the issue of just how computation- 
ally efficient the approaches can become: it would be optimistic to expect that they 
can compete with methods relying on much more simple measures of candidate 
solution quality. From a theoretical point of view, a lot remains to be studied. 
The reader would be justified in concluding that some aspects of our proposal 
are ad-hoc and that many variants can be easily dreamt up - without this paper 
providing any justification for the particular variants we have examined. Secondly, 
though we promote our schemes in terms of “breakdown point” , we acknowledge 
a number of issues in respect of this. We have not formally defined “breakdown 
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point” ; nor, consequently, have we in any way attempted to prove attainment of a 
high breakdown point. In these respects, our approach is intuitive and empirical. 
However, we trust, despite these shortcomings, the techniques we have described 
will be of use to the computer vision community (and wider) as the basis of proven 
practical methods which can be refined, and whose theoretical underpinnings can 
be explored. Moreover, we must point out that, despite impressions that may be 
obtained by reading much of the literature, particularly that aimed more at the 
practitioner, more traditionally accepted techniques still have their shortcomings 
in similar ways. For example, though it is often cited that Least Median of Squares 
has a proven breakdown point of 50%, it is often overlooked that all practical imple- 
mentations of Least Median of Squares are an approximate form of Least Median 
of Squares (and thus only have a weaker guarantee of robustness). Indeed, the 
robustness of practical versions of Least Median of Squares hinges on the robust- 
ness of two components (and in two different ways): the robustness of the median 
residual as a measure of quality of fit and the robustness of the random sampling 
procedure to find at least one residual distribution whose median is not greatly 
affected by outliers. Our procedures, like many other procedures, share the second 
vulnerability as we too rely on random sampling techniques. The first vulnerability 
is sometimes disregarded for practical versions of Least Median of Squares, because 
robustness is viewed as being guaranteed by virtue of the proof of robustness for 
the ideal Least Median of Squares. However, two comments should be made in 
this respect. Firstly, that proof relies on assumptions regarding the outlier distri- 
bution and it can easily be shown that clustered outliers will invalidate that proof. 
Secondly, there is an inherent “gap” between a proof for an ideal procedure and 
what one can say about an approximation to that procedure. We believe that our 
method of scoring the fits better protects against the vulnerabilities that structure 
in the outliers expose. We have presented empirical evidence to support that. The 
challenge is to not only continue to amass empirical evidence, but to also explore 
the theoretical properties of these (and similar) schemes. 
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Testing the Equality of Location Parameters 
for Skewed Distributions Using S\ with 
High Breakdown Robust Scale Estimators 

S.S. Syed Yahaya, A.R. Othman and H.J. Keselman 



Abstract. A simulation study had been carried out to compare the Type I 
error and power of a statistic recommended by Babu et al. (1999) for 
testing the equality of location parameters for skewed distributions. Othman 
et al. (in press) showed that this statistic is robust to the underlying popula- 
tions and is also powerful. In our work, we modified this statistic by replacing 
the standard errors of the sample medians with four alternative robust scale 
estimators; the median absolute deviation (MAD) and three of the scale es- 
timators proposed by Rousseeuw and Croux (1993); Qn, Sn, and Tn. These 
estimators were chosen based on their high breakdown value and bounded in- 
fluence function, and in addition, they are simple and easy to compute. Even 
though MAD is more appropriate for symmetric distributions (Rousseeuw 
and Croux, 1993), due to its popularity and for the purpose of comparison, 
we decided to include it in our study. The comparison of these methods was 
based on their Type I error and the power for J = 4 groups in an unbalanced 
design having heterogeneous variances. Data from the Chi-square distribution 
with 3 degrees of freedom were considered. Since the null distribution of 
is intractable, and its asymptotic null distribution may not be of much use 
for practical sample sizes, bootstrap methods were used to give a better ap- 
proximation. The Si statistic combined with each of the scale estimators was 
shown to have good control of Type I errors. 

Mathematics Subject Classification (2000). 62G10. 

Keywords. Type I error, power, bootstrap, skewed distributions, breakdown 
value. 



1. Introduction 

Progress has been made in terms of finding better methods for controlling Type I 
error and power to detect treatment effects in the one-way independent group de- 
signs. Through a combination of impressive theoretical developments, more flexible 
statistical methods, and faster computers, serious practical problems that seemed 
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insurmountable only a few years ago can now be addressed. These developments 
are important to applied researchers because they greatly enhance the ability to 
discover true differences between groups and improved their effort in trying to 
guard against seeing benefits that do not exist. 

Analysis of variance (ANOVA) is one of the most commonly used statistical 
methods for locating treatment effects in the one-way independent groups de- 
sign. Generally, violating the assumptions associated with the standard ANOVA 
method can seriously hamper its ability to detect true differences. Non-normality 
and heteroscedasticity are the two general problems in ANOVA. In particular, 
when these problems occur at the same time, rates of Type I error usually are 
inflated or depressed, resulting in spurious rejections of null hypotheses. They can 
also substantially reduce the power of a test, resulting in treatment effects going 
undetected. Reductions in the power to detect differences between groups occur 
because the usual standard deviation (a) is very sensitive to outliers and will be 
greatly influenced by their presence. Consequently, the standard error of the mean 
(cr^/n ) can become seriously inflated when the underlying distribution has heavy 
tails (Wilcox et ah, 1998). Therefore, the standard error of the F statistics in 
ANOVA is larger than it should be and power accordingly will be depressed. 

To achieve a good test, one needs to be able to control Type I errors and 
to increase the power. We do not want to lose power, and at the same time we 
do not want to inflate the Type I error. In recent years, numerous methods for 
locating treatment effects simultaneously controlling Type I error and power to 
detect treatment effects have been studied. The classical least squares estimators 
can be highly inefficient in non- normal models. In their effort to control the Type 
I error and power rate, investigators were looking into numerous robust methods. 
Robust methods generally are insensitive to assumptions about the overall nature 
of the data. Robust measures of location such as trimmed means, medians or M- 
estimators were considered as the alternatives for the usual least squares estimator, 
that is, the usual mean. These measures of location tendency have been shown 
to have better control over Type I error and power to detect treatment effects 
(Othman, Keselman, Padmanabhan, Wilcox, & Fradette, in press). Using trimmed 
means and variances based on Winsorized sum of squares will enable one to obtain 
test statistics which do not suffer losses in power due to non-normality (Wilcox, 
Keselman, & Kowalchuk, 1998). 

Babu, Padmanabhan, and Puri (1999) proposed a more flexible statistical 
method that can deal with asymmetric distributions and heteroscedastic settings. 
Known as the Si statistics, this method is one of the latest procedures in assessing 
the effects of a treatment variable across groups. Othman et al. (in press) replaced 
the standard errors of the sample medians in Si with asymptotic variances but 
this modification did not result in better Type I error control compared to the 
former. 

Unlike methods using trimmed means, when using one can work with 
the original data without having to transform or to trim the data in achieving 
symmetry. Simple transformations may fail to deal effectively with outliers and 
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heavy tailed distributions. Even the popular strategy of taking logarithms of all 
the observations does not necessarily reduce problems due to outliers (Wilcox, 
1997). 



2. Methods 



This paper focuses on the Si method and the modified Si methods. The modified 
Si methods are the Si statistics combined with each of the four scale estima- 
tors proposed by Rousseeuw and Croux (1993). These estimators, MAD^, S'n, 
and Tn were chosen for their high breakdown value and bounded infiuence func- 
tion, the basic tools for judging robustness (Wilcox, 1997). The Si methods were 
compared in terms of Type I error and power under conditions of normality and 
non-normality. Non-normality will be represented by skewed distributions since 
Si is more appropriate for skewed data. Si methods use sample medians as the 
central tendency. Being simple and having the highest possible breakdown value, 
the sample median is still a popular robust estimator of location. 



2.1. Si Method 



When dealing with a skewed distribution, the parameter of interest is therefore the 
population median. For this particular case, the statistics which are based on 
sample medians will be more appropriate to compare distributions (Babu et al., 
1999). Si is a solution to the problem when the assumption of symmetry is suspect. 
To understand ^i, consider the problem of comparing location parameters for 
skewed distributions. Let Yj = (Tij, > 2 ^ , • • • , ) be a sample from an unknown 

skewed distribution Fj and let Mi be the population median of f j : j = 1, 2, . . . , J. 
For testing Hq : Mi = M 2 = - = Mj versus Hi : Mi ^ Mj for at least one pair 
(i, j), the statistic is defined as 



5i= E 1"^^' 

l<i<j< J 




where Sij 



m-Mj) 

+Wj) 



A 

Wj = and 

Hi 



Mj is the sample median from the jth group, 

ujj is the squared mean absolute deviation from sample median M D j, and 
rij is the sample size for group j. 

Si is the sum of all possible differences of sample medians from the J distributions 
divided by their respective sample standard errors, Cj. Therefore, if there are J 
distributions, then the number of possible differences equals J(J — l)/2. Since the 
sampling distribution of Si is unknown, Babu et al. (1999), followed by Othman 
et al. (in press), used the bootstrap percentile method for obtaining p- values. 
According to Babu et al. (1999), the bootstrap method is known to give a better 
approximation than the one based on the normal approximation theory and this 
method is attractive, especially when the samples are of moderate size. Taking 
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into consideration the intractability of the sampling distribution of Si and the 
reliability of the bootstrap method, the p-values in our study were obtained by 
using the percentile bootstrap method (see, e.g., Efron and Tibshirani, 1993). To 
obtain the p-value, the percentile bootstrap method is used as follows: 

1. Calculate based on the available data. 

2. Generate bootstrap samples by randomly sampling with replacement rij ob- 
servations from the jth group yielding , ¥ 2 -^. ^ Y*.j 

3. Each of the sample points in the bootstrapped groups must be centered at 
their respective estimated medians. 

4. Use the bootstrap sample to compute the Si statistic, denoted by ^i*. 

5. Repeat Step 2 to Step 4 B times yielding 5*1, 5*2, • • • , 5*^. B = 599 appears 
sujficient in most situations when n >12 (Wilcox, 1997). 

6. Calculate the p- value as (j| of 5*^ > Si)/B 

The amount of computer time depends mainly on how long it takes to evaluate 
the bootstrap replications and increases linearly with B. The number B varies 
according to approximations. For estimating the standard error, B = 50 is often 
enough to give a good estimate, while larger B is needed for estimating the per- 
centiles. Efron and Tibshirani (1993) suggested that B should be at least 500 or 
1000 in order to make the variability of the estimated percentile acceptably low. 
Hypothesis testing will adopt the same range of B as the percentile to achieve 
acceptable accuracy. 

Type I error and power of the test corresponding to each method will be 
determined and compared. 

2.2. Scale Estimators 

When searching for measures of scale, the breakdown value turns out to have 
considerable practical importance (Wilcox, 1997). The four scale estimators pro- 
posed by Rousseeuw and Croux (1993) have the optimum breakdown value of 0.5. 
These scale estimators have explicit formulas, which guarantee uniqueness of the 
estimates. They also have bounded influence functions, which is one of the most 
important properties for robust estimators. Another advantage of using these es- 
timators is their simplicity, which make them easy to compute. 

Let X = (xi,X 2 , . . . ,Xn) be a random sample from any distribution and let 
the sample median be denoted by med^ Xi. 

2.2.1. MADn. A very robust scale estimator is the median absolute deviation 
about the median, given by 

MADn = bmedi \xi — medj Xj \ . 

The constant b is needed to make the estimator consistent for the parameter of 
interest. 

The MADn has the best possible breakdown value, and its influence func- 
tion is bounded, with the sharpest possible bound among all scale estimators 
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(Rousseeuw and Croux, 1993). Huber (1981) identified MAD^ as the single most 
useful ancillary estimate of scale due to its high breakdown property. 

Despite all these advantages, MADn has some drawbacks. It has very low 
efficiency (only 37%) with Gaussian distributions. MAD^ takes a symmetric view 
on dispersion, because one first estimates a central value (the median) and then 
attaches equal importance to positive and negative deviations from it, which does 
not seem to be a natural approach for asymmetric distributions. 

2 . 2 . 2 . Sn* Rousseeuw and Croux (1993) suggested alternatives to MAD^ that can 
be used as initial or ancillary scale estimates in the same way but that are more 
efficient and not slanted towards symmetric distributions. 

One such estimator is S'n, defined as 

Sn = cmedi {medj \xi — Xj\.} 

Sn is very similar to MADn- The only difference being that the medj opera- 
tion is moved outside the absolute value. This makes Sn a location free estimator. 
Instead of measuring the deviation of observations from a central value, Sn looks at 
a typical distance between observations. Another advantage is its explicit formula 
which means that this estimator is always uniquely defined. A modest simulation 
study by Rousseeuw and Croux found that the correction factor c = 1.1926 suc- 
ceeded in making Sn unbiased for finite samples. They also proved that Sn has 
the highest possible breakdown value. In terms of efficiency, Sn was proven to be 
more efficient (58.23 %) than MAD^ 

2 . 2 . 3 . Qn* Even though the influence functions for MAD^ and Sn are bounded, 
they have discontinuities. For a smooth influence function, Rousseeuw and Croux 
proposed an estimator Qn defined as 

Qn — d ■[ Xj I , i ^ j} 

where d is a constant factor. 




and h = [n/2] + 1. 

The estimator Qn shares the attractive properties of 5n ; a simple and explicit 
formula, suitable for asymmetric distributions, and attains the optimal value for 
its breakdown value (50 %). Other added advantages are the smooth influence 
function and the high efficiency (82 %) with Gaussian distributions. However, 
with small samples, Sn performs better than Qn- 

2 . 2 . 4 . Tn* Another promising scale estimator proposed by Rousseeuw and Croux 
(1993) which possesses the attractive properties of the robust scale estimator is 
Tn defined as 

Tn = 1.3800^ Imed \xi — xdl 

i(fc) 
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It was proven that Tn has a 50 % breakdown value, a continuous influence 
function, and an efficiency of 52 %, which makes it more efficient than MAD^. 
Like Sn and Qn? this estimator has a simple and explicit formula which guarantees 
uniqueness and it is suitable for asymmetric distributions. 

Taking into consideration all the attractive properties attached to the scale 
estimators, such as the breakdown value, continuous influence function, and their 
efficiency, we substituted the standard errors derived from them in place of a) in S'!. 



3. Procedures 

The procedures investigated were: 

1. Si with MADn 

2. Si with Qn 

3. Si with Sn 

4. Si with Tn 

5. Si with LJ 

Each of these five methods was tested for treatment group equality under two types 
of distributions, the normal and skewed distributions. Note that for the rest of this 
paper, each of these methods will be referred to by its scale estimator, MAD^, Qn, 
Sn, Tn, and lu. We compared MAD^, Qn, Sn, and Tn with the existing procedure, 
Lu in terms of their Type I error and power rate. 



4. Empirical Investigation 

For comparison with the work done by Othman et al. (in press), this paper focused 
on an unbalanced completely randomized design containing four groups with small 
samples. Since Si is appropriate for skewed distributions, we chose the xl distribu- 
tion for simulating the non-normality condition. The skewness and kurtosis values 
for the Xs distribution are 1.63 and 4.00 respectively. This distributional shape 
was chosen for reasons of comparability to the work. Type I error rates had been 
found to be distorted when the underlying distribution is skewed, e.g., the case of 
the two sample t-test in Sawilowsky and Blair (1992). Other conditions which are 
known to highlight the strengths and weaknesses of test for equality of location 
are heteroscedasticity, and the pairing of variances and group sizes. 

For this reason, only unbalanced designs and unequal variance of 36:1 ratio 
will be considered (see Table 1). Variances and group sizes are both positively 
and negatively paired. For positive pairings, the group having the fewest number 
of observations was associated with the population having the smallest variance, 
while the group having the greatest number of observations was associated with 
the population having the largest variance, whereas for the negative pairings, the 
group with largest observations was paired with smallest variance and the group 
with smallest observations was paired with population having largest variance. 
These conditions were chosen since they typically produce conservative results for 
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Table 1. Design Specification for the Four Groups. 



PAIRING 


GROUP SIZES 


GROUP VAR 




1 


2 


3 


4 


1 


2 


3 


4 


POSITIVE 


10 


15 


25 


30 


1 


1 


1 


36 


NEGATIVE 


10 


15 


25 


30 


36 


1 


1 


1 



Table 2. Type I Error rates. 



Distribution 


Pairing 


SI with corresponding scale estimators 






MAD„ 


Qn 


Sn 


Tn 


Cj 


xi 


pos 


0.027 


0.017 


0.029 


0.032 


0.004 


neg 


0.034 


0.019 


0.029 


0.037 


0.007 


Normal 


pos 


0.025 


0.012 


0.021 


0.023 


0.017 


neg 


0.029 


0.023 


0.028 


0.030 


0.013 




Average 


0.029 


0.018 


0.027 


0.031 


0.010 



the positive pairings and liberal results for the negative pairings (Othman et al., 
in press). We set the samples at rii = 10,n2 = 15, ns = 25 and U 4 = 30 and 
heterogeneous variances at 1,1,1, and 36 respectively for positive pairings and 36, 
1,1,1 respectively for the negative pairings. 

Our choices of these extreme conditions (skewness, heteroscedasticity, and 
unbalanced designs) were based on the premise that if a procedure works under 
extreme conditions, it is likely to work under most conditions to be encountered 
by researchers. 

The random samples were drawn using SAS generator RANNOR (SAS insti- 
tute, 1989). The variates were standardized, and then transformed to xl variates 
having mean jjj and variance cr|. The design specification for the four groups is 
shown in Table 1. 

For Type I error, the group means were (0, 0, 0, 0). For power, one of the 
group means will be non-zero. Cohen (1977) stated that for the effect size to be 
uniquely determined, the pattern separation of the means should be specified. 
Three patterns were identified, the minimum, intermediate, and maximum vari- 
ability. Our study focused on the intermediate variability, where the J means were 
equally spaced over the range. In this case, the group means were (—1, —0.5, 0.5, 
!)• 

For each of the designs, 1000 datasets were simulated, and 599 bootstrap 
samples were generated. 



5. Results 

The results for Type I error and power rates for the methods investigated were 
outlined in Table 2 and Table 3, respectively. 
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Table 3. Power rates. 



Distribution 


Pairing 


SI with corresponding scale estimators 






MAD„ 


Qn 




Tn 


LU 


xi 


pos 


0.078 


0.059 


0.075 


0.091 


0.100 


neg 


0.088 


0.064 


0.082 


0.102 


0.131 


Normal 


pos 


0.403 


0.365 


0.396 


0.408 


0.588 


neg 


0.227 


0.260 


0.221 


0.278 


0.715 




Average 


0.199 


0.187 


0.194 


0.220 


0.384 



Based on the liberal criterion of robustness (Bradley, 1978), a test can be 
considered robust if its empirical rate of Type I error, a , is within the interval 
0.5a < d < 1.5a. For the nominal level a = 0.05, the Type I error rate should 
be between 0.025 and 0.075. The empirical Type I error rates in Table 2 indicate 
robustness in three of the methods investigated. MAD^, Sn and Tn produced av- 
erage values ranging from 0.028 to 0.031, all within the Bradley’s liberal criterion. 
These methods produced higher Type I error rates for skewed distribution com- 
pared to the normal distributions. Even though the method using Qn estimator 
did not satisfy Bradley’s liberal criterion, the average error rate for both distribu- 
tions were higher than the default S\ method (with u). The average value for Qn 
was 0.018 while for a) the average value was only 0.010. However, both methods 
produced average Type I error rates which were considered to be too conservative, 
meaning that the estimated rates of Type I error were below 0.025. cu produced a 
more conservative Type I error rate for skewed distribution. 

For both distributions, the empirical Type I error rates for the positive pair- 
ings were smaller than for negative pairings, except for the pairings for Sn which 
showed no variability when the data were skewed. 

Our new methods, combining the Babu et al. (1999) Si and Rousseeuw and 
Croux (1993) scale estimators were able to show some improvement over the de- 
fault Si using uj in terms of Type I error rate. The Tn method resulted in the 
best average error rate of 0.031, which was nearest to the nominal level. All the 
methods studied (excluding lj) produced better average error rate for the skewed 
distribution compared to normal distribution. The average error rate across the 
three methods, MAD^, Sn^ and Tn-, exhibited small variability for both distribu- 
tions. 

The average power values outlined in Table 3 show two sets of results. The 
low values belong to the four methods when data were skewed, whereas the larger 
values were obtained when data were normal. Ranging from 0.075 to 0.131, the 
average power rate for the new methods under skewed distribution were low. The 
average values when data were normal ranged from a low of 0.221 and to a high 
of 0.408. The default under normal distribution resulted in an average value 
rate of 0.7. Even though the default under skewed distributions produced the 
highest average power rate compared to the rest of the methods, the value of 
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0.131 was still very low. The mean values across the four methods showed very 
little variability for both distributions. 

The negative and positive pairings for the skewed distributions did not show 
much variability for each method, but for the normal distribution, the variability 
was obvious for different pairings. 



6. Conclusion 

This paper focused on the situation, common in psychological and educational 
research, where the observations are from skewed distributions. One requires sta- 
tistics which are robust especially in locating treatment effects. Realizing the need 
of a good statistic in addressing this problem, we integrate the Si statistic by Babu 
et al. (1999) with the high breakdown scale estimators of Rousseeuw and Croux 
(1993). This paper has shown some improvement in the statistical solution of lo- 
cating treatment effects. In controlling the Type I error rate, the study reported in 
this paper leads us to formulate the following conclusions and recommendations. 
When symmetry is suspect, we can avoid trimming or transforming the observa- 
tions by using one of the methods in our paper. These new methods produced 
better Type I error rates than the default using a). Three of the investigated 
methods, MAD^, and Tn^ reasonably controlled Type I errors; the remaining 
methods, Qn and a), were conservative at a significance level of 0.05. 

The methods are considered robust when they meet the criteria for robustness 
with values in between 0.025 and 0.075 for 0.05 level. 

The findings on power rate did not show any improvement from the previous 
research done by Othman et al. (in press). Babu et al. (1999) in their investigation 
on exponential and log-normal distributions with S\ also produced low power rates 
which were less than 0.10. 
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Rank Scores Tests of 
Multivariate Independence 

S. Taskinen, A. Kankainen and H. Oja 



Abstract. New rank scores test statistics are proposed for testing whether two 
random vectors are independent. The tests are asymptotically distribution- 
free for elliptically symmetric marginal distributions. Recently, Gieser and 
Randles (1997), Taskinen et al. (2003a) and Taskinen et al. (2003b) introduced 
and discussed different multivariate extensions of the quadrant test, Kendall’s 
tau and Spearman’s rho statistics. In this paper, standardized multivariate 
spatial signs and the (univariate) ranks of the Mahalanobis-type distances 
of the observations from the origin are combined to construct rank scores 
tests of independence. The limiting distributions of the test statistics are 
derived under the null hypothesis as well as under contiguous sequences of 
alternatives. Three different choices of the score functions, namely the sign 
scores, the Wilcoxon scores and the van der Waerden scores, are discussed in 
greater detail. The small sample and limiting efficiencies of the test procedures 
are compared and the robustness properties are illustrated by an example. It 
is remarkable that, in the multinormal case, the limiting Pitman efficiency 
of the van der Waerden scores test equals to that of the classical parametric 
Wilks’ test. 

Mathematics Subject Classification (2000). Primary 62G10; Secondary 62H15. 

Keywords. Affine equivariant signs, efficiency, Kendall’s tau, robustness, 
Spearman’s rho, Wilks’ test. 



1. Introduction 

Let xj = , x\^^ ) for i = 1, . . . , n be a random sample of vector pairs, where 

x[^^ and x\^^ are p- and g-dimensional continuous random vectors. We wish to test 
the null hypothesis 

H[): x[^^ and are independent. 



The research was partially supported by the Academy of Finland. 
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The classical parametric test due to Wilks (1935) is based on the partitioned 
sample covariance matrix S and is defined as 



" |5n||522r 

Puri and Sen (1971) introduced a nonparametric analogue to Wilks’ test where 
the data vectors are replaced by the vectors of their componentwise ranks. Gieser 
and Randles (1997) and Taskinen et al. (2003a) proposed invariant extensions of 
the univariate quadrant test of Blomqvist (1950). The former test procedure is 
based on interdirection counts and the latter on standardized spatial signs. If the 
marginal distributions of and are elliptic, these two tests are asympto- 
tically equivalent. Later Taskinen et al. (2003b) proposed multivariate invariant 
extensions of Kendall’s tau and Spearman’s rho. 

Our plan is as follows. In Section 2, we explain the test constructions start- 
ing with standardized spatial signs and ranks of the lengths of the standardized 
vectors. The test statistics for multivariate dependence are then introduced. Spe- 
cial choices of the score functions then yield the sign test, the Wilcoxon scores 
test and the van der Waerden scores test. In Section 3, the limiting distribution 
of the test statistic is derived under the null hypothesis and under interesting se- 
quences of contiguous alternatives. The finite-sample and limiting efficiencies of 
the new procedures are then compared to that of the classical Wilks’ test in Sec- 
tion 4, and the robustness properties are illustrated by an example in the final 
Section 5. The proofs can be found in Taskinen et al. (2003c) on the web site 
http://www.maths.jyu.fi/~ojahannu. 



2. The Rank Scores Test Statistics 



2.1. Spatial Signs and Ranks of the Distances Prom the Origin 

Consider a random sample x\^. . .Xn from a fc- variate distribution. The spatial 
sign of vector x is defined as 




X 

X = 0, 



where ||a;|| = (cc^x)^/^ is the (Euclidean) length of the vector x. The spatial signs 
S{xi) and ranks rank(||cci||) of the distances from the origin are not invariant 
under affine transformations to the data vectors, however. In order to construct 
invariant test statistics, the data points have to be standardized before spatial 
signs and ranks are formed. For the standardization we need affine equivariant >/n- 
consistent location vector and scatter matrix estimates, /i and C. The transformed 
data points are then given as Zi = C~^!^{xi - /i), i = 1, . . . , n. 

The vectors Ui = S(zi), i = l,...,n, are called standardized spatial sign 
vectors. Standardized sign vectors are affine invariant in the sense that if u* are 
calculated from x* = Axi -h 6, i = 1, . . . , n, with a nonsingular k x k matrix A 
and fc- vector 6 , then u* = Pui, i = 1, . . . ,n, for some orthogonal P. See, e.g.. 
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Taskinen et al. (2003b). The ranks Ri = rank(||zi||) are naturally affine invariant 
(in the usual sense). Note that, in the standardization, the scatter matrix estimate 
C may be replaced by a ^^-consistent affine equivariant shape matrix estimate 
V as only the directions and ranks of distances are used in the analysis. For the 
shape matrices, see Ollila et al. (2003). Note also that, if the standardization is 
done using such location vector and scatter (or shape) matrix estimates that do not 
require any moment assumptions of the underlying data (e.g., Tyler’s shape matrix 
and the transformation retransformation spatial median in Hettmansperger and 
Randles, 2002), then the resulting test procedures are valid without any moment 
assumptions. 

2.2. New Test Statistics 

Our test statistic for testing the null hypothesis of independence is obtained as 
follows. For i = 1, . . . ,n, write for p-dimensional standardized 

sign vectors based on the first components and let denote the rank of 
among For the second random vector, write similarly 

for g-dimensional standardized sign vectors based on and let and 
be constructed as before. The test statistic is then as follows. 



Definition 2.1. Let a : (0, 1) ^ R and fe : (0, 1) ^ R be continuous, monotone and 
square integrable score function and write 



H = ave. 






r: 



(1) 






( 2 ) 



+ 1/ \n + 1 
The rank test statistic for testing Ho is then 



u) ’u) ’ 



\ 



T„, = 



npq 



m\^ 



where \\H\\‘^ = Tr{H^H), = E[a^{U)] and = E[b‘^{U)] with U uniformly 
distributed on (0, 1). 



Note that since standardized sign vectors and ranks are invariant with respect 
to the group of affine transformations, the invariance of easily follows. As 
score functions, one may use optimal location score functions. See Hallin and 
Paindaveine (2002), for that. In the following, some choices of the score functions 
and resulting test statistics are given. 



Definition 2.2. For a{u) = 1 and b{u) = 1, the sign test of independence (Taskinen 
et al., 2003a) with test statistic 



Ton 



II r^{l)^{2y 

np^||ave^{n) u- 



2 



is obtained. For a{u) = u and b{u) 
independence with test statistic 

9npq 



one gets the Wilcoxon (scores) test of 



Tlr 



{n + 1)^ 



ave^{i?! 
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Finally the choices a{u) = and b{u) = ^ where is a cdf 

of chi-square distribution with k degrees of freedom, yield the van der Waerden 
(scores) test of independence with test statistic 



f-2n 



n 



ave^ 



^-1 

p 



R) 



(1) \lV2 



n + 1 






-1 



K 



( 2 ) 



n + 1 



1/2 



-(1)^(2)^ 
u) ’u) ’ 



3. Limiting Distributions 

In order to derive the limiting distribution of T^, we assume that the marginal 
distributions of and are elliptically symmetric. The marginal density 
functions are then of the form 

Z(a;) = |S|-i/Vo(S-i/'(a:-At)), 

where E is a positive definite symmetric matrix and fo{z) = exp{— p(||z||)} with 
z = /i). Note that if r = ||z|| and u = zjr^ then r and u are independent. 

In the following we denote the cdf of as Gi and the cdf of as G 2 - 

To establish a limiting distribution of our test statistic under the null hy- 
pothesis, we need the following lemma. 



Lemma 3.1. Let 

H = avei{a(Gi(rf }> 

then y/n{H - H) -^p 0. 

Now the limiting distribution can be found easily. 

Theorem 3.2. Under Hq and for elliptically distributed x^^^ and x^‘^\ the limiting 
distribution ofTn is a chi-square distribution withpq degrees of freedom. 



Next we derive the limiting distribution of under alternative sequences 
similar to those used in Gieser and Randles (1997). As is afiine invariant, we 
restrict to the spherical case only. See Appendix, for a discussion on the alternative 
sequences. Let thus x\^^ and be independent with spherical marginal densities 
exp{— pi(||x^^^||)} and exp{— p 2 (||i^^^^||)}, respectively, and write 



( 1 ) 

i 

( 2 ) 



(l-A)/p AMi 

AM2 (1 - A)lJ Uf ) 



(3.1) 



where A = 5/^/n. If T* is calculated from transformed observations in (3.1), we 
get 



Theorem 3.3. Under general assumptions, the limiting distribution o/T* is a non- 
central chi-square distribution with pq degrees of freedom and noncentrality param- 
eter 






||ciMi+C2Mj|p, 
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where 

Cl = E[a(Gi(rf^))t/;i(r«)]i;[ 6 (G 2 (rf ))rf)] 
and 

C2 = E[b{G2{r\^^))M^?^)]E[a{Gi{r\^^))r\% 
with optimal location score functions and ' 02 (^P^) = 



4. Limiting and Finite-Sample Efficiencies 



4.1. Limiting Pitman Efficiencies 

In this section we consider the sign, Wilcoxon and van der Waerden tests of inde- 
pendence: We compare the limiting and finite-sample efficiencies of the new tests 
to those of the Wilks’ likelihood ratio test Wn- The comparisons are made in the 
multivariate normal distribution, t distribution and contaminated normal distri- 
bution cases. Since — nlogWn has, under the alternative sequences, a limiting 
noncentral chi-squared distribution with pq degrees of freedom and noncentrality 
parameter 5‘^\\Mi -h Mjjp, the asymptotic efficiencies are simply 



ARE(T,,Wn) 



||ciMi+C2Mj|p 



where c\ and C 2 are given in Theorem 3.3. Note that for multivariate normal 
distribution, '0(r) = r, for fc-variate t distribution with u degrees of freedom, 
'0(r) = (fc -h u)r/ {u + r^) and for fc-variate contaminated normal distribution with 
cdf F{x) = (1 - e)$(x) -f e4>(c“^cc), where c > 0 and $ is the cdf of Nk{0, Ik), 

, . (1 - e) exp(-r^/2) -f exp(-r^/2c^) 

(1 - e) exp(-r2/2) + ec~^ exp(-r2/2c^) 



Assume now for simplicity that M\ = Mj. For the limiting efficiency of 
the sign test of independence, we refer to Taskinen et al. (2003a). The limiting 
efficiency of the Wilcoxon test Tin with respect to the Wilks’ test Wn is 

Apq 

where 

Cl = E[G,{Tf^)Mrf^)\m2{rf)rf\ 

and 

C2 = E[G2{rf^)'ij}2{rf'^)\E[Gi(rf'>)rf\ 

The resulting efficiencies for t distributions with selected degrees of freedom and 
dimensions are listed in Table 1 and for contaminated normal distributions with 
e = 0.1 and for selected values of c in Table 2. The efficiencies were derived using 
numerical integration. 
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Table 1. ARE{Tin, Wn) at different p- and g- variate t distribu- 
tions for selected u = = U 2 - 



Q 


P 

2 3 5 8 10 


2 


1.089 1.064 1.023 0.986 0.970 


3 


1.039 0.998 0.961 0.946 


z/ = 5 5 


0.958 0.922 0.907 


8 


0.886 0.871 


10 


0.857 


2 


0.970 0.960 0.934 0.907 0.893 


3 


0.950 0.925 0.898 0.884 


1 / =z oo 5 


0.901 0.874 0.861 


8 


0.848 0.835 


10 


0.823 



Table 2. ARE {Tin, Wn) at different p- and ^-variate contami- 
nated normal distributions for e = 0.1 and for selected values of c. 



Q 


P 

2 3 5 8 10 


2 


1.216 1.204 1.172 1.137 1.121 


3 


1.192 1.161 1.126 1.109 


c = 3 5 


1.130 1.096 1.080 


8 


1.063 1.048 


10 


1.034 


2 


1.833 1.815 1.767 1.714 1.689 


3 


1.797 1.749 1.697 1.672 


c = 6 5 


1.703 1.652 1.628 


8 


1.603 1.579 


10 


1.556 



Further, the limiting efficiency of the van der Waerden test T2n as compared 
to the Wn is 

where 

Cl = £;{[^;i(Gi(rf >))]VVi(rf ))}E{[^-i(G2(rf ))]VVP 

and 

C2 = £;{[^-i(G2(rP))]i/V2(rf 

Now for the multivariate normal distribution, ARE(T2n,lTn) = 1 , and for the 
contaminated normal distribution, ARE(T2n, Wn) = (1 — e + e/c)^(l — e + ec)^. 
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These efficiencies do not depend on the dimensions at all. For the efficiencies 
at certain contaminated normal distributions, see Figure 1. The efficiencies for t 
distribution with 5 degrees of freedom were derived using numerical integration 
and are listed in Table 3. 




c 



Figure 1. ARE(T 2 n^ Wn) as a function of c at the contaminated 
normal model with e = 0,0.05,0.10,0.20. 



Table 3. ARE{T 2 n^ Wn) at different p- and ^-variate t distribu- 
tions with ui = U 2 = 5. 




Now some comments follow. First of all, the limiting efficiencies of the Wil- 
coxon test Tin decrease with increasing dimension while the efficiencies of sign 
test Ton and van der Waerden test T 2 n increase or stay constant. Due to this 
property, for low dimensions, the efficiencies of Tin are higher than those of Ton, 
but for high dimensions, Ton outperforms Ti^. The van der Waerden scores test 
is the most efficient one in all considered cases. When the underlying distribution 
is multivariate normal, it is as efficient as the Wilks’ test. When the distribution 
becomes heavy-tailed, the efficiencies are higher than those of Ton and Tin (for 
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the contaminated normal distribution with e = 0.1 and c = 3 and c = 6, the 
efficiencies of T 2 n are 1.254 and 1.891). For comparisons of limiting efficiencies, see 
also Figures 2 and 3. 

4.2. A Simulation Study 

A simple simulation study was used to compare the finite sample efficiencies of 
Wn^ Ton, Tin and T 2 n- 1500 independent and x^^^-samples of sizes n = 50 
and 200 were generated from a multivariate standard normal distribution, from a 
t distribution with 5 degrees of freedom and from a contaminated normal distribu- 
tion with e = 0.1 and c = 6. The transformation in (3.1) with Mi = Mj = / was 
applied for chosen values oi A = 6/ y/n to introduce dependence into the model. 
The tests were applied using the location and shape estimates chosen to satisfy 

avejS^ 1 = 0 ^rid pave{5^ j = 

and 

ave{5^ } = 0 and qsive{S^ j = 

that is, the transformation retransformation spatial median and the Tyler’s Id- 
estimate (Tyler, 1987; Hettmansperger and Randles, 2002). For the transformation 
retransformation technique, see also Chakraborty et al. (1998). The critical values 
used in test constructions were based on the chi-square approximations to the null 
distributions. 

In Figure 2, the empirical powers as well as exact limiting powers (n = 
oo) computed using Theorem 3.3 are given for p = g = 3. In the multivariate 
normal case Wn is slightly better than T\n and T 2 n and much better than Ton* 
In the t distribution case no big differences can be seen between tests and in the 
contaminated normal case Tin and T 2 n outperform Wn and Ton- In Figure 3, the 
empirical powers are illustrated for p = g = 8. In the multivariate normal case Ton 
and T 2 n are slightly more powerful than Ti^. In the considered t distribution case 
Tin performs poorly, but as the underlying distribution is contaminated normal. 
Tin performs very well. As p = g = 8, the sizes of Ton and T 2 n are often slightly 
below 0.05. The size of Tin is very close to 0.05 in all cases and for heavy-tailed 
distributions, the size of Wn often exceeds 0.05. 



5. A Robustness Study and Final Comments 

Finally, a simple simulation study was used to illustrate the robustness of test 
statistics proposed above. Independent and x^^^-samples of size n = 30 were 
generated from a bivariate standard normal distribution and the transformation 
in (3.1) with Mi = Mj = I 2 was applied for chosen values of A to introduce 
“positive” dependence into the model. By positive dependence we mean that each 
-coordinate is positively dependent on each x^^^-coordinate. Finally, the first 
observation vectors in each sample were replaced by contaminated vectors = 




Figure 3. Empirical powers for p = ^ = 8 using the multivariate 
normal distribution (first row), multivariate t distribution with 
z/ = 5 (second row) and contaminated normal distribution with 
6 = 0.1 and c = 6 (third row). The thick solid line denotes Wn, the 
thin solid line Tin, the thick dotted line T 2 n and the thin dotted 
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(c, c)^ and = (— c, — c)^ with “negative” dependence. The procedure was 
repeated 1000 times and mean p- values were computed. 

In Figure 4, the mean p- values are illustrated as a function of contamination 
value c for A = 0 and for A = 0.2. In the null hypothesis case (A = 0), all tests give 
p-values close to 0.5, as the contamination value is near zero. Note also that Ton 
and Tin give practically the same p-values as Wilks’ test. When the contamination 
value is high, p-values given by Wilks’ test decrease considerably and some decrease 
is also seen in the p-values of Tin and T 2 n- In the considered case, the sign test 
Ton seems to be the most robust one, since the mean p-value is constant as a 
function of c. As A = 0.2, the contamination slightly increases the mean values 
of rank scores tests. In the case of Wilks’ test the p-values first increase and then 
decrease to zero with the contamination value. The careful analysis shows that 
the small p-values for large contamination values erroneously indicate “negative” 
dependence, however. 



a- 

E 





Figure 4. Mean p-values for the true null hypothesis Hq : A = 0 
(left figure) and for the alternative hypothesis Hi : A = 0.2 (right 
figure) as a function of contamination value as described in the 
text. The thick solid line refers to the thin solid line to Ti^, 
the thick dotted line to T 2 n and the thin dotted line to Ton- 



In the paper, new affine invariant rank scores procedures were proposed for 
testing whether two random vectors are independent. The test statistics were con- 
structed using standardized spatial signs and ranks of the lengths of the stan- 
dardized vectors. It is remarkable that, the proposed tests are valid without any 
moment assumptions on the underlying data as far as the standardization is done 
using such location vector and scatter (or shape) matrix estimates that do not 
require any moment assumptions. In the paper, three different score functions, 
namely the sign scores, the Wilcoxon scores and the van der Waerden scores were 
considered in more detail. The tests have good limiting and finite-sample efficien- 
cies and as illustrated by an example, the tests are resistant to outliers. 
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Appendix: Some Notions on Alternative Sequences 



For all elliptic cases, it is enough to consider the alternative sequences 



y 



(1) 

i 

,( 2 ) 



(1-A)7p AMi 

AM2 (l-A)/JUf) 



where A = 5/^/n and and are independent with spherical marginal 
distributions. This is because, for the weighted sum of elliptical marginals, 

/(1-A)7p AMi 

VyfV V 

_ (A 0\ / (1 - A)7p AA-^MiB\ ( 

1,0 B) \AB-^M2A (1 - A)I, ) ) 

(A 0\/(l-A)7p AM[ 

Vo 5jV (l-A)7jUp)|- 



Hence (due to affine invariance) one can as well consider the sequence 

/(1-A)7p AM[ \(xf^\ 

V AM^ (1-A)7j V®f7’ 

for different choices of M[ and M’ 2 . In all cases, the efficiencies are then of the 
same type ||ciMi + C 2 MJH, where c\ and C 2 depend on the marginal spherical 
distributions and the test used. If the marginal distributions are of the same type 
(that is Cl = C 2 for all tests to be compared), then the efficiencies do not depend on 
Ml and M 2 . Note also that the tests are not “unbiased” (noncentrality parameter 
> 0) for all alternative sequences. They are all unbiased for normal marginals. 
If the marginals are nonnormal but of the same type, then they all fail under 
Ml = -Mj. 
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The Influence of a Stochastic Interest Rate 
on the n-fold Compound Option 

L. Thomassen and M. Van Wouwe 



Abstract. We reintroduced the idea of an n-fold compound option as a gen- 
eralization of Geske’s (2-fold) compound option in the same framework of 
constant interest rates. For the valuation of long-term financial agreements 
(life insurance products) this assumption is not always realistic so that the 
stochastic modelling of the interest rates might be a better approach. 
According to Miltersen et al. (1997), we will use the requirement of simple in- 
terest rates over a fixed finite period to be log-normal distributed. With these 
assumptions, closed-form solutions are determined for the n-fold compound 
call options written on zero-coupon bonds. 

A numerical illustration of the application of robust methods to interest rates 
is discussed. 

Mathematics Subject Classification (2000). Primary 91B28; Secondary 62P05. 
Keywords. Financial, n-fold compound options, log-normal interest rates. 



1. Introduction 

In constructing an n-fold compound call option model, Thomassen and Van Wouwe 
(2002) used the same economic assumptions as Geske (1979) did when introducing 
his (2-fold) compound option. One of the economic hypotheses is the assumption 
of a deterministic risk-free interest rate both for the original compound call option 
and the generalized compound call option. 

If however one looks at possible application fields of the n-fold compound 
call option, it appears that long-term commitments are involved. Indeed, the in- 
troduction of for instance new life insurance contracts are long-term processes. So 
one can start questioning the validity of a constant interest rate. The life insurance 
practice shows that the currently used approaches to represent interest rates in 
actuarial models may be broadly categorized as deterministic and stochastic. 

The most familiar deterministic approach is still a single interest rate with 
a slight generalization to a series of interest rates for future years. In recent years 
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one also sees the development of a tremendous profusion of models for valuation 
of interest rate sensitive financial instruments. Wall street dealers routinely use a 
multiplicity of models based on widely varying assumptions in different markets. 
For example, an options desk most likely uses a version of the Black formula to 
value interest rate caps and floors, implying an approximately log-normal distri- 
bution of the interest rates. 

A critical issue in selecting an interest rate model is ease of application. It 
is indeed difficult or impossible to provide efficient valuation algorithms for some 
models. For that reason we focused on the literature to find closed-form solutions 
for term structure derivatives with log-normal interest rates. We were inspired by 
the reasoning developed by Miltersen et al. (1997). 

They derived a unified model that gives closed-form solutions for caps and 
floors written on interest rates as well as for puts and calls written on zero-coupon 
bonds. Their crucial assumption is that the simple interest rates over a fixed finite 
period are log-normal distributed, an assumption that is shown to be consistent 
with the Heath-Jarrow-Morton model for a specific choice of volatility. As they 
pointed out, log-normal distributed interest rate models avoid the problem of neg- 
ative interest rates but the rates explode with positive probability, implying zero 
prices for bonds and hence allowing arbitrage opportunities. 

The problem of exploding interest rates disappears as shown in Sandmann 
and Sondermann (1994) if instead of assuming the continuously compounded in- 
terest rates to be log-normal distributed, one assumes the simple interest rates 
over a fixed finite period to be log-normal distributed. The main result of their 
article is a unified model that provides closed-form solutions for interest rate caps 
and floors as well as for puts and calls written on zero-coupon bonds within the 
context of a log-normal interest rate model. 

The closed-form valuation is a Black-Scholes alike formula which we could 
generalize for n-fold compound call options written on zero-coupon bonds in the 
same framework as by Miltersen et al. (1997). 



2. A Model for Simple Forward Rates over a Fixed Period 



Let /(^, T, a) denote the simple forward rate at time t over a fixed period of length 
a prevailing at date t for the future time interval [T, T+a]. According to Miltersen 
et al. (1997) the simple forward rates will be modelled as log-normal diffusions: 

df{-, T, a)t = T, a) ■ f{t, T, a) dt + T, a) ■ f{t, T, a) dWt 



Let P(t, T) denote the price at time t of a default-free zero-coupon bond that 
pays 1 at maturity date T. It then follows that 



P{t,T^a) = P{t,T) 



1 

l + a/(t,T,a) 



If the forward price F(t, T, a) of the contract is defined as the fixed price 
which the buyer agrees to pay at date T for the bond with maturity T + a, such 
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that the value of the forward contract at date t is zero, by no-arbitrage assumptions 
the forward price is given by 



F{t,T,a) 



P{t^ T o^) 
P{t,T) 



3. A Closed-form Solution for the n-fold Compound Call Option 

We reintroduce the notion of an n-fold compound call option: 

Definition 3.1. By induction: 

A compound call option (of order 2) is a call option on a call option. This can be 
generalized to a compound call of order n (with exercise date and price given by 
ti and Ki) with an underlying call of order n — 1. 

Notations: 

V : current market value of the firm, 

ti : maturity date of investment for the compound call option Ci, 

Ki : exercise price for the compound call option 

Ci : current value of the compound call option on the option 

r : risk-free rate of interest, 

: instantaneous variance of the return on the assets of the firm, 

Nn{ai^a 2 , • • • , Ctrl] F) ' n- variate cumulative normal distribution 

function with as upper limits 
and F as the correlation matrix. 

Theorem 3.2. Suppose for s = n, n — 1, . . . , 2 the calls Cg of order n + 1 - s are 
known and given by: 

n 

-J2^rn Nm+i-s {bs,bs+i,. ..,bm; Ff+i-*) , 

m=s 

with the notations 



ae 


— bi (Ty/t^ — t 


£ = 1,2, . . . ,n 


be 




^ = 1, 2, . . . , n 


— t 


Ve 


= solution of the equation 






Ce+i{V,te) = Ke 


^ = 1, 2, . . . , n — 1 


Vn 


= 




Pij 


Iti-t 

]j tj-t * 




Fi 




' fu = 1 

fij — Pi+s — l,j+s — 1 
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Then the n-fold compound call option can be found: 



Cl 



(^1 5 ^2 ? • • • 5 5 -^1 ) 

n 

-'^Km Nm {bi,b2, ...,bm; Fn . 

m=l 



For a proof the reader is referred to the article by Thomassen and Van Wouwe 

( 2002 ). 

We now make a particular choice for the underlying by taking a forward F instead 
of the current market value of the firm V. This implies that we will use the forward 
price C for a call option (7, defined by 

With the adapted definition for an n-fold compound call option we are in the 
possibility to value the forward price of an n-fold compound call option. The 
valuation is proved in the following theorem by induction: 

Theorem 3.3. The following notations are introduced: 



— ti-^i ti 

pti 

Si = l'^{r,tn,an)dr 

Fi is solution of P{ti, ti+i) ■ Q+i {U, F) = Ki 

F — K 


Vi = 1,.. 


.,n 


Vi = 1,.. 


.,n 


Vi = l,.. 


.,n 


Oi — bi -\- 


Vi = 1,.. 


.,n 



1 



In 



F{1 - Fj) 
Fi{l - F) 



~2 



Vi = 1, . . . ,n 



where F is used as the abbreviated notation for the forward rate F{t,tn^o^n)- 
Suppose = n, n — 1, . . . , 3, 2 the value of the (n-v+1 )-fold is known and given by 



n— 1 



a = n ^i-t-1 ) ' F • Nji — yJ^l {Oy , OyJ^l , . . . , Ofi] Gjf) 

i=v 

n i—1 

• • • ? Cj^ ) ^i+i) 



J=V 






— ^ Ki' {1 — F) • Ni-yJ^l{hy^ byj^l^ • • • , ) • JJ P(tj^tjJ^l) 

j=v 



l=V 
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then the value of the n-fold compound call is determined by the following closed- 
form solution: 

n — 1 

— J][ P{U^U-hl) ' P ' 

1=1 

n 1 

~Y^Ki ■ F ■ Ni{ai,a 2 ,. . . ,ai-,Gl) 

1=1 j=i 

n i—1 

- • (1 - F) • Niih, 62 , , k; Gl) • n P{t„t,+i) 

2=1 j = l 

where the correlation matrices are defined by the structure 



Gy — (^ab)a,b=l,2,...,y— x+1 with 



Qaa — I 

9ab = 9ba = if a <b 



( where equals 1 if the upper limit is smaller than the lower limit and where ^ 

equals 0 if the upper limit is smaller than the lower limit). 



Proof, n = 1: This result was obtained by Miltersen et al. (1997). 

n > 1: Suppose V?; = n, n - 1, . . . , 3, 2 the value of the {n - v + l)-fold compound 
call Cv is given by 

n — 1 

Gy = j[ Pifi-i ^2+l) ' P ' ^-u+1 5 • • • ? Gjf) 

i=v 

n i—1 

-Y^Ki^F^Ni — V + l 1 ^V-\-l ? • • • ? <^25 G^ ) • jj^ Jl^ P{^j > ^j+1 ) 

i=v j=V 

71 2-1 

- • (1 - F) ■ Ni^,+,{b,,b ,+^, . . . ,F;C?r) • n 

i=v j—v 

For the n-fold compound call C\ we construct the following hedge: 

Cl = ipl . P{t,ti) • F{t,tn,an) +lpt -Pit^tl) 

SO that the forward value fulfllls 



Cl = 'Ipl ■ F{t, tn, an) + ift- 

Using ltd calculus on the one hand and the fact of a self-flnancing portfolio on the 
other, we obtain the PDF for the forward price of the compound call option: 

dC^ 1 d^C 

+ ^l‘^{i,'tn,an) ■ (1 - F{t,tn,a„))^ • F‘^{t,tn,a„) ■ = 0. (3.1) 
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By analogous substitutions as in Miltersen et al. (1997): 

rti 

t -> Si= 7^(r,e„,a„)dr 

F ( 1 

F 0 = In suppose F < 1 so F == — - 

1-F V l + exp(- 2 ) 

g{si,z) = Ci{t,F) 

exp(^/ 2 ) + exp(— z/ 2 ) 
b{si) = exp(-si/ 8 ) 
u,. _ 9{si,z) 



h{si,z) = 



a{z) ■ b{si) 



the above equation (3.1) is rewritten as a diffusion equation ^ hzz — hs^ =0 and 
solved using a Green’s function of the form 

After some calculations, the solution of the diffusion equation is given by: 



n—i 

Hsi,z) = ]][ P(ti,ti+i) exp y + 2 



with the notation 



and with 



|ln| 

n i—1 






i=l 7=1 

p-^OO 

/f , P , 1 ■■■,«: 

J( InbrAr- 



,a*\Gf)<t>{u)du 



|l-n| 2 

n z— 1 









Z=1 j=l 
r+oo 



[in 2_K I -2^+ s I / \/^ 



,b*-,Gf)4,{w)dw 



1 r 



Gf = {9ab)a,b=l,2,- , 3-1 with 



Qaa — 1 



9ab — Qba 



if a < b. 
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So the flnal boundaries and correlations are given by the following expressions: 
s* = Sj{ti)= 

Jti 



Observing that 
we obtain 



Si T Sj — s j , 



bj 




+ In 



Fj[^-F] 



2 



} 



V^w + ln 



F[l-Fj] 

Fj[^-F] 




Using the appendix the final equation writes: 



n—1 

h{si,z) = H 

i=l 

n i—1 

i=l j=l 
n i—1 

-'E^U P{tj,tj+i)e^^/^e-^/^Ni{h,b2, Mi) 

1=1 j=l 

with correlation matrices 



■^j — (j^uv^u,v=i, 2 ,...,j with 



— I 

'^^uv — '^vu 



^1 + 

Si + Sj 




if u < i; 



or in other words: Mj = Gj. 

Returning to the value of the forward n-fold compound call option we conclude: 

n—1 

Ci{t,F) = l[P{ti,ti+i)-F-Nn{a,,...,an-,Gi) 

i=l 

n i—1 

— Ki • PJ • F • Ni(ai, . . . \ G\) 

i=l j=l 

n i—1 

- E • n • [1 - f’] • N,{bu. ■ . , bi-Gl) 

i=l j=l 

For a detailed proof, the reader is referred to Thomassen and Van Wouwe (2003). 

□ 
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4. Robust Parameter Estimate: the Influence on the Numerical 
Calculation of the n-fold Compound Option 

For the numerical estimate of the n-fold compound option we return to the situa- 
tion of a constant interest rate model, because no data on forwards are available. 
Under this constraint it was proved in Thomassen and Van Wouwe (2002) that 
the value of the n-fold compound option is given by Theorem 3.2. 

The problem for the numerical evaluation essentially reduces to carefully 
choosing the estimates for the risk-free interest rate and the volatility of the un- 
derlying asset. In Thomassen and Van Wouwe (2002) the sensitivity of the n-fold 
compound option towards these two variables was analyzed and they found the 
following expressions 



dCi 

da 



1 



n— 1 



exp 






Vj+i exp [- r{tj+i - i)] 



I “ Pi J+l ^i+l h ~ ^J+l „* . ryni 

■lSn-1 I / ■ 



with the boundaries ^*^ 2 ? ‘ taken at time t = tj^i and in the value V = Vj-\-i , 



dCi 

dr 



= ^ Km {tm - 1) Nm (& 1 , 62 , • • . , rn 



m=l 



revealing a positive sensitivity of the compound call Ci towards r and a. 




Figure 1. Interest rates. 

For a particular data set of interest rates (Source: Data stream, interest 
rates on three month - Belgian Treasury bonds, see Figure 1) the classical average 
equals 0.0441. In applying some robust statistics, the MOD location estimator (see 
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Rousseeuw and Van Driessen, 1999) may be considered and provides an estimate of 
0.0447 for the same data set. Since we want a ‘safe’ value for the n-fold compound 
option from the point of view of the buyer in exploiting the positive sensitivity 
of the compound option towards the interest rate, the robust estimate has to be 
preferred. 

For an adequate choice of the volatility problems arise. The chosen data set 
(Daily volatility of Solvay during the period of the last 5 years) does not obey the 
elliptic condition. The application of the MCD location estimator reveals the pres- 
ence of a number of outliers. In our search for robust methods which emphasize no 
assumptions on the distribution, we meet the Hodges-Lehmann location estimator 
(by Hodges and Lehmann, 1963). An overview of the different estimates gives: 

al^j = 0.0003029498 
= 0.000266752 

(j]j^ = 0.000286128. 

We are confronted with the fact that the MCD location estimator does not 
provide an estimate for the volatility as expected. In the next table we give a 
picture of the influence of taking the different estimates for the interest rate and 
the volatility on the value of the n-fold compound option. We assume a 4-fold 
compound call option with 

• risk free rate of interest ln(l -h j^) 

• volatility a 

• exercise prices 

Ki = 100 (1 - e"0, K2 = 100 (1 - e'O, A'a = 100 (1 - e“0, = 100 

• maturity dates: 

= 21, t2 = 22, h = 23, U = 24 

• present time to 0 

• present value at time to'^ V = 100. 

The table expresses a strong dependency of the 4-fold towards changes in the 
interest rate. The changes in the volatility are however too small to elaborate 
differences in the value of the 4-fold compound option. 

Table 1. 4- fold compound call evaluated for different estimates 
of interest rates and volatilities. 



r 


0.0441 0.0447 


= 0.0003029498 


59.5968 60.0813 


^MCD — 0.000266758 


59.5968 60.0813 


= 0.000286128 


59.5968 60.0813 
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We observe from the numerical evaluation of the 4-fold compound option 
that the robust estimator for the interest rates can be a useful tool for a safer 
valuation. The differences between the different estimates for the volatility are too 
small to express a preference between these estimators. 



5. Conclusion 

In considering a log-normal model for the interest rate a closed-form generalization 
of the n-fold compound option is introduced and proved. 

The numerical evaluation of the n-fold compound option shows that the robust 
estimators can be used for ‘safe’ valuations of the n-fold compound option. 



Appendix 

To integrate a product of an exponential and an n- variate normal CDF, we need 
the following theorem: 



Theorem A.l 









AT ( f 9 + ^2 fq + bi 



92 



= Ni{h, 



^2 



’x/7^’ 

with correlation matrices 

R — {'^ab^a,b=l,2,...,i—l 

( Taa=l 

such that < ^ ^ 9a-\-i 

1 Tab = Ha ^ 

V Qb^l 

H — {hab^a,b=l,2,...,i 

{ haa — I 
such that < 



;H 



9i 



; R dq 



for a < b 



hab — hl)a — 



lf + 9l 



and where for convenience 



for a <b 



Proof. The proof can be found in corollary A.l of a paper by Thomassen and Van 
Wouwe (2002). 
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Robust Estimations for Multivariate 
Sinh~ ^-Normal Distribution 
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Abstract. In this paper we construct estimators for the parameters of the 
multivariate sinh~ ^-normal distribution. We discuss some properties of these 
estimators, such as consistency, B-robustness and asymptotic normality. 
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1. Introduction 

The problem of robust estimation for the parameters of the multivariate distribu- 
tions was especially studied in the case of the elliptical distributions and particu- 
larly for the multivariate normal distribution. In this paper we consider the multi- 
variate sinh“ ^-normal distribution. This distribution is generated from the normal 
distribution by using the inverse hyperbolic sine normal transformation suggested 
by Johnson (1949) (who generate his system of distributions called Johnson trans- 
lation system). In using this distribution it can be helpful to specify the mean 
vector and the covariance structure. This facet of its use has played an important 
role in discriminant analysis simulation studies (see Johnson, 1987). Several Monte 
Carlo studies in discriminant analysis have used the members of Johnson transla- 
tion system, particularly the multivariate sinh~^-normal distribution to model the 
distributions governing populations. One of the first papers was by Lachenbruch 
et al. (1973), who studied the estimation of error rates in linear and quadratic 
discrimination under departures from the multivariate normal distribution. Their 
study provided the distributional framework for subsequent Monte Carlo work by 
Koffler and Penfield (1979). One of the specific distributions used from the John- 
son system was the ten- variate sinh“ ^-normal distribution. Also Conover and Iman 
(1980) used the multivariate sinh~ ^-normal distribution to compare discriminant 
analysis procedures. The purpose of this paper is to construct robust estimators 
for the parameters of the p- variate sinh“^-normal distribution. 
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For a given p-dimensional sinh“ ^-normal data set, we define estimators for 
the mean vector and the covariance matrix, by applying a sinh“ ^-transform to 
the data, computing robust estimators for the obtained Gaussian data and trans- 
forming those estimators for the sinh“ ^-normal, using the relationship between 
the parameters of both distributions. For every choice of robust estimators in 
the p- variate normal case, other robust estimators for the parameters of the p- 
variate sinh“^-normal distribution are obtained. The problem of robust estimation 
in the p-variate normal case has been widely studied, in this sense an overview of 
some existing estimators of multivariate location and scatter matrix being given 
in Maronna and Yohai (1998). 

In the following we consider those estimators with the property that there 
exists a functional T defined on a convex set Dt of distributions and valued in the 
parameter space 0, such that, if Xi, X 2 , . . . , Xn are random vectors i.i.d. with the 
distribution G = Ge E Dt-, then 

Tn{XuX2,...,Xn)^T{G). 

The notation G is the same for a probability distribution and its correspond- 
ing cumulative distribution function. The functional T is called Fisher consistent 
if T{Ge) = e, for all (9 G 0. 

The influence function of the functional T in G measures the effect of an 
infinitesimal contamination located at a single point x. We thus consider the con- 
taminated distribution Gsx = (1 — ^)G + e5x^ where 5x is the distribution putting 
all its mass at x. Then the influence function is defined by (see Hampel et ah, 
1986) 

t(gJ-t(g) 

IF (x; T, G) = lim — ^ ^ . 

£—>•0 e 

The gross error sensitivity measures approximately the maximum contribution to 
the estimation error that can be produced by a single outlier and is defined as 
sup^ ||/F(x;T, G)||. Whenever the gross error sensitivity is finite, the functional 
T is called B-robust. It is often said that the estimator Tn associated with the 
functional T is called B-robust. 

Throughout the paper, F denotes the p- variate normal distribution Np (//, V), 
where p eW and V GSPD(p), with SPD(p) the set of all symmetric and positive 
definite matrices. F' is the p- variate sinh“ ^-normal distribution with mean m 
and covariance matrix S. We will consider estimators tn and Gn of p, and P, 
respectively, and will denote by t and G the corresponding statistical functionals. 
We will suppose that t and G are B-robust and Fisher consistent. Particularly we 
will take the case when the functionals t and G are affine equivariant meaning that 

t{AX Fb) = At{X) + b 
G{AX + b) = AG{X)A\ 

for any b eW and any p x p nonsingular matrix A (here the notation T{Z) instead 
of T{G) means that Z ^ G\ 
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The outline of the paper is as follows. In Section 2, we construct estimators 
for the parameters of the p- variate sinh“ ^-normal distribution and prove their 
consistency. In Section 3 we discuss the B-robustness properties, while in Section 
4 we give some asymptotic results. 



2. The Estimators 

Let X = be a random vector with normal distribution 

and define 

Y={Y\...,Y^y = [smh{X^),...,smh{X^)]\ 

where sinh(x) — (e^ — e~^) /2. The random vector Y has a p- variate sinh“L 
normal distribution. For notation let 

E{Y^)=mi, 

cov{Y\Y^) = Sij, 

for all i, j = 1, . . . , p and let jji and Vij denote analogous quantities in the Np (//, V) 
distribution. The following relationships hold: 

rui = sinh(//f), (2.1) 

_ i| l+cosh(2M.)e~ p.2) 



)/2 



cosh(ft+ft) _ 



(2.3) 

for all i, j = 1, . . . ,p, where cosh(x) = (e^ + e /2. These formulae can be found 
for instance in Johnson (1987). 

In the following, for any p x p symmetric matrix A — (a^j), let uvec A denote 
the p(p + l)/2 dimensional column vector (an, U 22 , • • • , a^p, an, ais, . . . , ap-i^pY 
formed from the elements in the upper triangular half of including the diagonal 
elements. 

We consider the functions 

f . |^P+p(p+l)/2 ^ g . j^p+p(p+l)/2 ^ ]^p(p+l)/2 

defined respectively by 

f ( ^ I = y g ( ^ 1 = uvec Y 

V uvecX ^ ^ V nvecX 



with 



Vij = e 



{xii-\-Xjj)j2 



y. ^ gXii/2 

’ cosh{xi+Xj) _ cosh{xi-Xj) s ■ n 

V JJ_ _ g x^J V JJ_ _ gjjih(rri) sinh(a;j) 



(2.4) 
i) 

(2.5) 
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for all i,j = where x = (a:i, . . . ,a;p)‘, y = (j/i, . . . , j/p)‘ and X = (x^), 

Y = (yij) are p x p symmetric matrices. Note that these functions are defined in 
accordance with the relationship between the parameters of both distributions. 

Let Li , . . . , Fn be a sample drawn from F' which is supposed to be the p- 
variate sinh"^-normal distribution with mean m = (mi, . . . , rripf and covariance 
matrix S = We define m^ , Yn) and Wn (11, . . . , Yn) the esti- 

mators of the mean m and of the covariance matrix S', respectively by 






uveclT„(Yi,...,T„) = g 



tn{Xl, . . . , Xn) 



uvecCn{Xi,...,Xn) 
where X[ = In (y^ + \jiXly + ^ for all A; = 1, . . . , n and I = l,...,p, 



(2.6) 

(2.7) 

Xl,Yl 



being the components of the random vectors Xk and Y^, respectively. 

The problem of robust estimation for the parameters of the multivariate 
normal distribution has been widely studied. We give here two examples: the M- 
estimators (see Maronna, 1976) and the S-estimators (see Davies, 1987; Rousseeuw 
and Leroy, 1987). 

In the following, the notation \\x\W will be used for x^Ax where x e W 
and A is a symmetric positive definite matrix. Consider a sample of p-dimensional 
observations Xi, . . . , The M-estimators are implicitly defined by 



Cn = -Y,W2{\\Xi-tn\\l-i)iXi-tn){Xi-tnf 



where wi and W 2 are specified weight functions. 

The S-estimators are defined as the solutions (t^, Cn) to the problem of min- 
imization of det(C) subject to 



l^/>(||A,-f||c-i) = A:p 



among all (t, C) e R^xSPD(p). The function p is nondecreasing, kp = EfoP{\\x\\), 
Fq = J\fp{0,lp). The S-estimators have excellent robustness properties. Moreover 
they are fast to compute, in this sense the SURREAL algorithm of Ruppert (1992) 
is recommended. Theoretical results favor the use of the S-estimators since they 
combine high efficiency with appealing robustness properties including a smooth 
influence function. 
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Remark 2.1. The statistical functionals corresponding to the estimators and 
Wn are given by 

m{G) = f ( \ ^ , uvecVE(G) = g ( x ^ 

^ ^ y MvecC{Gou) J \ j tf y uvecC(Gou) J 

where u is the function u(xi, . . . , Xp) — [sinh(a:i), . . . , sinh(xp)]. 



Theorem 2.2. The estimators rrin and Wn have the properties 
rUn m{F') and W„ W{F') 



where 

m{F') = f 



t{F' o u) 
uvec C(F' o u) 



, uvecVK(F')=^ 



t{F' o u) 
uvecC(F' o u) 



where u is the function from Remark 2.1. 



(2.8) 



Proof. First let us observe that 

F = F'ou (2.9) 

where u is the function u(xi, . . . , Xp) = [sinh(xi), . . . , sinh(xp)]. 

Prom the consistency of the estimators tn and Gn we obtain the consistency 
of ((^n)^ (uvecCn)^)^ and then the continuity of the functions / and g imply the 
consistency of mn and Wn, respectively. The asymptotical values are obtained 
using (2.9). □ 



Let us note the fact that the Fisher consistency of the functionals m and W 
follows using the Fisher consistency of t and (7, on the basis of the relations (2.1) 
and (2.3). 



3. Influence Functions 

In this Section we will calculate the influence functions of the functionals m and W 
and will prove the R-robustness of these functionals. Particularly, we will consider 
the case when the functionals t and G are affine equivariant. 

We recall that for any affine equivariant location functional t possessing an 
influence function, there exists a function 7t deflned on [0,oc[ and M-valued such 
that 

IF {Z-, t, F) = (||z - {z - m) (3.1) 

where F is the distribution Np{p,V) (see Toma, 2003). 

In Lemma 1 from Croux and Haesbroeck (2000), we see that for any affine 
equivariant scatter functional G possessing an influence function, there exist two 
functions ac and (3c deflned on [0, oo[ and R valued such that 

IF ( 2 ; C, F) - ac i\\z - {z - fx){z - ^)‘ - Pc {\\z - V (3.2) 

where F is the distribution ATp(/x, V). 
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Theorem 3.1. Suppose that t and C are Fisher consistent functionals. The influ- 
ence function IF {x;m^ F') has the components 

IF (x; m, F')^ = cosh{fii)IF{z-, t, F)i + ^ sinh(/iii)/F {z- C, F)^^) (3.3) 

where z = {zi,..., Zpf , Zi — \n (xi + for all i - 1, . . . ,p. 

Proof First we note that F^^ ou = where 

z = (zi,..., Zpf , Zi = \u for alH = 1, . . . ,p, 

and u is the function from Remark 2.1. Next, using the definition of the influence 
function, we have 



IF{x;m,F') = lim 

£—>■0 



m F,' J -m(F') 



1 / HFez] 

lira- f \ ^ .L 

e-^o e \ uvec C (f. 



uvec C (F) 



Due to the Fisher consistency of t and C we obtain 



IF{x\m^F^)- = lim ■ 

£-^0 



m{F')-m{F\ 






= lim i / lt{h.)-t(F)i+[c{Fe,).-c(F)u]/2 _ ^ 
£— ^0 £ ^ 2 - 

|'g-t(F„).+t(F),+[c(F„),.-C(F),,]/2 _ I 



o{J‘i+Viil2 



IF{z-,t,F)i + -IF{z-C,F), 



p l^i~\~Viij2 1 

^ IF{z-,t,F)i + -IF{z;C,F)u 

coshi^ii)IFiz■,t,F)i + ^smh{^H)IF{z■,C,F)ii , 



for every i = 1, . . . and this completes the proof. 



Corollary 3.2. Suppose that t and C are Fisher consistent and B-rohust. Then the 
functional m is B -robust. 
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The influence function can also be helpful for computing asymptotic variances 
(cfr. Hampel et ah, 1986, page 226). Prom this point of view it is useful to consider 
the particular case when t and C are affine equivariant. 

Corollary 3.3. If t and C are affine equivariant, then 



IF (a;;m,F')j = (||2: - |u||^-i) {zi~ in) 

+isinh(^i) } (3-4) 

for all i = 1, . . . ,p. 



Figure 1 and Figure 2 represent the norm of the influence function of the mean 
functional m for the particular case in which F = A/*2(0, diag{2, 1)). For tn and Cn 
both classical maximum likelihood estimators and S-estimators are considered. 

In the following, uvec W denotes the statistical functional deflned by 

(uvec VF) (G) == uvec VF(G). (3.5) 

Theorem 3.4. Suppose that t and C are Fisher consistent. The influence function 
of W in F' is given by 

uvec IF (x; W,F') = IF (x; uvec VF, F') 

where the ij component of IF (x; FF, F') is 

IF {x] W, F')^j = { sinh(;Ui + /ij) - e smh{ni - fx,) (3.6) 

- 2 cosh(/ij) sinh(/iij)] IF{z] t, F)i + sinh(^i + nj) + sinh(;Ui — fXj) 

- 2 sinh(/ij) cosh(^j)] IF{z; t, F), + cosh {fXi + fXj) - cosh (/x^ - /Xj) 

- 2sinh(/Xi) sinh(/Xj)] [/F(z; C, F)u + IF{z\ C, F),,] /2 

+ cosh {ni + Hj) + cosh {m - /Xj)] IF{z; C, F)ij} , 

for i,j = 1, . . . ,p, with z = (zi, . . . ,Zpf , 2 :^ = In (xi + fori = l,... ,p. 



Proof As in Theorem 3.1. we use the fact that F^^^u = F^z^ where z = {zi, . . . , Zp) , 



Zi = In (^Xi + for alH = 1, . . . ,p and u is the function from Remark 2.1. 

Then 



/F(x;uvecVF, F') = lim 



uvec VF — uvecFF (F') 



= lim - 

£-*■0 e 



t 



m 



uvecG 



(f„) 



-9 



t{F) 

uvec C (F) 



Using the Fisher consistency of t and C and following an analogous calculus 
as in Theorem 3.1, we obtain the formula (3.6). □ 
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Figure 1. Norm of the influence function of the mean for classi- 
cal MLE at F = A/*2(0, diag(2, 1)). 




Figure 2. Norm of the influence function of the mean for S- 
estimators at F = A/ 2 ( 0 , diag(2, 1)). 



We note the fact that in the particular case when i = formula (3.6) sim- 
plifles to 



IF {x‘ W, F'),, = - 1) sinh(2//,)/F(z; t, F), 

+ — (1 - cosh {2fii) + 2e*'" cosh {2ni)) IF{z] C, F)u. (3.7) 

Corollary 3.5. Suppose that t and C are Fisher consistent and B -robust. Then the 
functional W is B-robust. 

Particularly, when t and C are affine equivariant, by replacing (3.1) and (3.2) 
in (3.6) we obtain the ij component of IF (x; W, F'). 
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4. Asymptotic Results 

In this Section we prove the asymptotic normality of the estimators rrin and Wn 
and give a result regarding the asymptotic relative efficiency of the couple estimator 

(uvec • The asymptotic normality is obtained through the following 

lemma, whose proof can be found for example in Mardia et al. (1994). 

Lemma 4.1. Let Zi, Z 2 , . . . he a sequence of random vectors valued in asymp- 
totically normal in the sense that 

,M{Zrr-a)-^NK{0,V) 

for a G and V a K x K symmetric and positive definite matrix. Let also h be 
an application defined on and valued which is differentiable in the point a. 
Then h (Zi ) , h (Z 2 ) , . . . is asymptotically normal. More precisely, 

{h (Z„) - h(a)) Nl ( 0 , Dh{a)V {Dh{a)Y^ 

where Dh{a) is the differential of h in a. 



Theorem 4.2. Suppose that ^Jn {tn — p) has a limiting normal distribution with 
zero mean and asymptotic covariance matrix ASV{t, F) and y/nuvec (Cn — V) has 
a limiting normal distribution with zero mean and asymptotic covariance matrix 
ASV (uvec C, F) . If tn and Cn are asymptotically independent, then ^fn {mn — m) 
and y/ri\ivec{Wn — S) have limiting normal distributions with zero means and 
asymptotic covariance matrices 



ASV{m,F') = Df 



T 

uvec V" 



and 



ASV{t,F) 0 \nf( f 

0 ASV (uvec C,F)J yuvec V J 

(4.1) 



ASV {\iyecW,F') = Dg 



T 

uvec!/ 



ASV{t,F) 0 

0 ASV {uvec C,F) 



Dg 



T 

uvecy 



t 



(4.2) 



Proof. By the hypothesis conditions we obtain that (t^^, (uvec CnY^ is asymptot- 
ically normal with mean (uvecF)^^ and asymptotic covariance matrix 



ASV{t,F) 0 

0 ASV {uvec C,F) 



Next we apply Lemma 4.1 to (uvec Cn)j with h = f and find that mn is 

asymptotically normal with the mean m and asymptotic covariance matrix given 
by (4.1). 

Likewise, using h = g we obtain the asymptotic normality of uvec Wn- □ 
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As it is known the asymptotic relative efficiency of the estimator of T(G) = 
0 could be defined whenever it exists, as 



Eff(T,G) = 



det {ASV {Tci,G)} y^^ 

det{^5y(T,G)} ) 



(4.3) 



where ASV(Tc'^, G) is the asymptotic covariance matrix corresponding to the max- 
imum likelihood estimator (MLE) of 6 and ASV(T, G) is the asymptotic covariance 
matrix corresponding to Tn. 

In the following we suppose that the conditions of Theorem 4.2 are fulfilled. 
We denote by tcu uvecGc/ and Uci the statistical functionals corresponding to 
the MLE of /x, nvecV and (m, (uvec*S)^)^ respectively; U will be the statistical 

functional corresponding to the couple estimator (uvec . 



Theorem 4.3. The asymptotic relative efficiency of (uvecWn) J is 

Eff{U, F') = [Eff{t, i^)]2/(P+3) [i;//(uvec C, F)] +i)/(p+3) ^ (4 4 ) 

where Eff{t, F) and Eff{uvec G, F) are the asymptotic relative efficiencies for tn 
and uvecGn, respectively. 



Proof The MLE for (uvecS')^^ is obtained from the MLE of (uvecF)^^ 
through the transformation h defined on Rp+p(p +^)/2 and valued 

h{x) - {f{x)\g{xY)\ 



because h is one to one. 

Due to the fact that the maximum likelihood estimators for the parame- 
ters of the multivariate normal distribution are asymptotically independent, the 

asymptotic covariance matrix corresponding to the MLE of (uvec5)^^ is 

ASV{Uci,F') = Dh( M 0 W/" 

^ ^ ^uvecVy 0 ASV(avecCci,F) J \uvecVy 

(4.5) 

ASY {tci, F) and ASV(uvec Cci, F) being the asymptotic covariance matrices cor- 
responding to the maximum likelihood estimators of /t and uvecV^, respectively. 
On the other hand, the asymptotic covariance matrix corresponding to 

(uveciy„)*j is 



ASV {U, F') = Dh 



uvecF 



ASV{t,F) 

0 



0 

ASV{mecC, F) 



Dh 



uvect^ 



t 



(4.6) 
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Using (4.3), we obtain 



ES{U,F') 



/det {ASV {Uci, F')}\ i/(p+p(p+i)/2) 

V det {ASV {U,F')} ) 

/det [ASV {tci,F)} det [ASV (uvec Cci, F)}\ 

V det [ASV {t,F)} det [ASV {uyecC,F)} ) 
[Eff(t,F)]^/^P+^^ [Eff(uvecC,F)]^P+^^/^P+^^ , 



and this completes the proof. 



□ 



5. Conclusions 

In this paper we introduced estimators for the mean and covariance matrix of the 
multivariate sinh” ^-normal distribution. We proved the Fisher consistency of the 
corresponding functionals and derived the influence functions which appear to be 
bounded. We also obtained the corresponding asymptotic covariance matrices and 
the asymptotic relative efficiency of the couple estimator. 
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A Robust Estimator of the Tail Index 
Based on an Exponential Regression Model 

B. Vandewalle, J. Beirlant and M. Hubert 

Abstract. The objectives of a robust statistical analysis and of an extreme 
value analysis apparently are contradictory. Where the extreme data are 
downweighted in robust statistics, these observations receive most attention 
in an extreme value approach. The most prominent extreme value methods 
however are constructed on maximum likelihood estimates based on specific 
parametric models which are fitted to exceedances over large thresholds. So 
within an extreme value framework some robust algorithms replacing the max- 
imum likelihood part of this methodology can be of use leading to estimates 
which are less sensitive to few particular observations. This study is motivated 
by a soil database quality management project, where in the background of 
Pareto- type tails, automatic identification of suspicious data is needed. 

Mathematics Subject Classification (2000). Primary 62G35; Secondary 62G32. 
Keywords. Extreme value statistics, robustness. 



1. Introduction 

In agriculture, a new concept of crop management has emerged, permitting within- 
field variation of crop techniques as, for instance, the adjustment of fertilizer inputs 
on the basis of soil sampling and analysis. The development of these techniques 
has greatly increased the demand for soil data and Laboratories are burdened 
with large data sets, which inevitably cause concern about outliers and quality 
of the information. Therefore, automatic outlier detection methods have become 
a necessity in the database management in order to provide high quality data 
to Laboratories. The present paper studies Ca and pH records from the Condroz 
region in Belgium (1505 observations, Goegebeur et ah, 2002). The Ca distribution, 
conditional on pH-level, appears to be right-skewed and long-tailed, resulting in 
rather frequent large Ca measurements, as can be seen in the scatterplot of Ca 
versus pH (given in Figure 1). Robust statistical procedures which assume that the 
regular data points are sampled from a normal distribution will flag too many large 
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observations as outliers. Such long tailed data can be analyzed more efficiently in 
the framework of extreme value theory. We will present a method which, in the 
context of a model with heavy tails, allows to point out potential outliers which 
need to be investigated before further analysis can be done. 




4.5 5.0 S.S 6.0 6.S 7.0 7.5- 

ISH 



Figure 1 . Scatterplot of Ca versus pH for one of the communities 
of the Condroz region in Belgium. 



2. Extreme Value Statistics 



In extreme value statistics the extreme value index (or tail index), denoted by 7 , is 
used to characterize the tail behavior of a distribution. This real- valued parameter 
helps to indicate the size and frequency of certain extreme events under a given 
probability distribution: the larger 7 , the heavier the tail. 

Consider Xi,. . . ,Xn independent and identically distributed (i.i.d.) random 
variables with common cumulative distribution function F and quantile function 
Q. Denote the corresponding order statistics by Xi^n < • • • < ^n,n and suppose 
that the properly centered and normed sample maxima Xn,n = max{Xi, . . . , Xn} 
converge in distribution, for n — > 00 , to a non-degenerate limit. This limit distri- 
bution is necessarily of extreme value type. Indeed, sequences of constants Un > 0 
and bn can then be found such that 



lim P\ 

n— >00 



^n.n 



< X = 



(2.1) 



with 



H^{x) 



exp 



exp 



^-(l+7x) l-f7X>0,7 7^0, 

exp (-x)^ , X G K, 7 = 0. 



( 2 . 2 ) 



If (2.1) is satisfied, then F is said to belong to the maximum domain of attraction 
of Hj, denoted as F G 
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Distributions F for which 7 > 0 are called Pareto-type (or heavy-tailed) 
distributions, i.e., F{x) = with Ip a slowly varying function. The 

Gumbel class of distributions with 7 = 0 is a quite extensive class of distributions, 
mainly with exponential decreasing tails. The Weibull class, with 7 < 0 consists of 
distributions with a finite right endpoint for which F(x+ — 1/x) = x^^^If{x) 
with Ip again a slowly varying function. A general reference on extreme value 
statistics is Embrechts et al. (1998). 

We will concentrate on Pareto type distributions, i.e., distributions for which 
there exists a 7 > 0 such that F{x) — x~^^^lp{x) or, U{x) = x^lu{x) with Ip, lu 
slowly varying functions and U{x) = Q{1 - 1/x) with Q the quantile function of 
F. As for slowly varying functions lp{tx)/lp{t) 1 for all x > 0 and t 00 , the 
conditional distribution of relative excesses P{y > x \ X > t) converges to 
for all X > 1. 

A graphical tool for checking Pareto type behavior is the Pareto quantile 
plot. As log- transformed Pareto distributed random variables are exponentially 
distributed, the hypothesis of strict Pareto behavior can be verified by looking 
at an exponential quantile plot based on the log-transformed data, leading to a 
Pareto quantile plot 

(log j = l,...,n. (2.3) 

For Pareto type distributions, since [/(x) = x^lu{x), it follows that 

log [/(x) = 7 log X + log lu (x ) . (2.4) 

Since loglu{x)/\ogx ^ 0 as x oo we have that logU{x) ~ 7logx as x ^ oc, 
which implies that for Pareto type distributed data, the Pareto quantile plot, 
ultimately for the smaller j-values, shows a linear behavior, with slope 7. 

In Figure 2(a), the Pareto quantile plot is shown for the variable Ca for one 
of the communities in the Condroz database. The last seven observations that 
do not follow the ultimate linearity of the rest of the Pareto quantile plot can 
be considered as outliers with respect to the Pareto model. As can be seen from 
the Ca versus pH scatterplot given in Figure 1, extreme Ca measurements tend to 
occur more often at higher pH levels, justifying a tail analysis conditional on higher 
pH levels. In Figure 2(b), the Pareto quantile plot is shown for the variable Ca, 
conditional on pH- levels from 7 up to 7.5 (428 observations). Now, six observations 
can be indicated as possible outliers with respect to the Pareto model. We will use 
the data set with pH-levels from 7 up to 7.5 in the sequel. 

The problem of estimation of the extreme value index, and of extreme quan- 
tiles and small exceedance probabilities, in case the distribution is of Pareto type, 
has been studied in great detail in the recent literature. Hill (1975) introduced the 
estimator 

1 ^ 

'Jk.H ~ ~j^ ^ ^ 1Q§ ^n— j+1,^ 



log Xfi—k,n 



(2.5) 
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(a) (b) 



Figure 2. Pareto quantile plot of Ca for the Condroz data (a) 
not taking into account pH and (b) conditional on pH-levels from 
7 up to 7.5. 

or, equivalently 

1 ^ ^ 

'Jk,H — ^ ^ J ^n— j+l,n “ lug^^n— j,n)- (^-®) 

This estimator has received much attention in the literature. As the Hill estimator 
measures the average increase of the Pareto quantile plot above an anchor point 
(log(|^),loga:n-fc,n) it can be interpreted as a slope estimator of the linear part 
of the Pareto quantile plot. More recently several authors have recognized and 
exploited the potential of quantile plots in estimating 7 > 0 (Beirlant et ah, 1996, 
Schultze and Steinebach, 1996, and Kratz and Resnick, 1996). 

Recently, in Beirlant et al. (1999), Feuerverger and Hall (1999) and Beirlant 
et al. (2002), it was proven that under some suitable conditions on the slowly 
varying function Ip and for k relatively small with respect to n (i.e., k/n = o(l) 
as A:, n ^ 00), the scaled log-spacings 

yj = j (logXn-j+l,n - logXn-j,n), I < j < k < U, (2.7) 

approximately follow an exponential regression model 

Vj^d ^7 + ^n,k ^ _|^ ^ (^‘^) 

with the Qj denoting i.i.d. standard exponential random variables, bn,k 0 as 
fc,n — > 00, and p < 0. From this model 7 can be estimated jointly with bn,k and 
p using the maximum likelihood method. In comparison to the Hill estimator, the 
maximum likelihood estimator for 7 based on (2.8) is typically more stable over k 
and performs better with respect to bias. 

Figure 3(a) shows the Hill (solid line) and maximum likelihood (broken line) 
estimates for the tail index of the conditional distribution of Ca as a function of 
k. On this plot, we clearly see the influence of the outlying Ca measurements on 
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the 7-estimates. Considering decreasing fc, then when k becomes smaller than 100 
the estimates first increase drastically and then suddenly decrease for the smallest 
fc-values. Figure 3(b) shows estimates of the asymptotic mean squared error of the 
Hill estimator 

AMSE{%_„) = % + ( (2.9) 

which, as in Beirlant et al. (1999), can be used as a criterion to find an optimal 
sample fraction kopt for the Hill estimator. Here, kopt is found to be 249 (vertical 
line) leading to an estimate ^kopt^H = 0.298. 




(a) (b) 

Figure 3. (a) Hill (solid line) and maximum likelihood (broken 
line) estimates of the conditional Ca measurements and (b) AMSE 
estimates for the Hill estimator as a function of k. 

3. Robust Estimation of the Tail Index 

To obtain a robust estimate of 7, we start by transforming the exponential regres- 
sion model (2.8) to its linearized form, given by 

J/j 7 + +7ej, l<j<k<n, (3.1) 

where Cj = 9j — Further we specify a canonical value for p. In Matthys and 
Beirlant (2000) it is shown that for most applications a p value between —2 and 
0 is most appropriate. Moreover, estimates of 7 and hn^k are not highly influenced 
by a specific choice of p. Hence, they recommend to use p = —lovp = —0.5. 
For a fixed value of fc, expression (3.1) becomes a linear regression model with 
asymmetric errors of the form 

yj = 9i+92tj + crej, j = l,...,k, (3.2) 

with tj = (^) , Cj = 9j -l,9i= 7, 92 = 6„,fe and <7 = 7. 
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In Marazzi and Yohai (2004) high efficiency and high breakdown point es- 
timators were proposed for this type of regression models. In general, consider k 
observations with j = 1, . . . ,fc and tj = that satisfy the 

linear relationship 

yj = (9i -h 02t2j + . . . + Optpj + aej, j = 1, . . . , fc, (3.3) 



where 0 = (0i, . . . Op)' G is a vector of regression parameters and a is a scale 
parameter. Here, the error terms ej are assumed to be i.i.d. as a random variable e 
with cdf Fo,i, which is the standard element of a parametric location-scale family 
of asymmetric distributions with density function /a,^ and cdf = Fq^i{{z — 

X)/a). 

As a robust initial estimator for the general model (3.3) with asymmetric 
errors, a corrected S-estimator for regression is suggested. The biweight S-estimate 
(0* , a* ) of (0, a) with 50% breakdown value is defined by 0* = arg min6iS'mo (^) and 
d* = Smo{0*)- For a given 0, Smo{0) is an M-estimator of scale of the residuals, 
as the solution of 

k 

T— - XI ^^0 {{Vj - = 0-5 (3.4) 

with respect to S (Rousseeuw and Yohai, 1984). The function Xm and the constants 
ao and mo are defined by 



f 3{z/m)‘^ — 3{z/rn)'^ (z/m)^ if | 2 : |< m, 
\ 1 if I 2; |> m. 



(3.5) 



and satisfy f ~ ao)/o,i(^)d2; = 0 and / Xmoi^ ~ ao)fo^i{z)dz = 0.5, where 
'iprn — ^Xm- A Consistently corrected S-estimate for 0 and a is then given by 
§1=61- d*ao, 6^ = §j for j = 2 , . . . ,p and = d*. 

Next, a reweighting step is applied, as proposed in Rousseeuw and Leroy 
(1987). Observations whose standardized residual Vj = i^j — t' 0^^ /d^ has a large 

negative log-likelihood pj = p{rj) = — log(/o,i(rj)) have a small likelihood under 
the model Fq,i and can be fiagged as outliers. They are given a weight Wj = , 

where I denotes the indicator function, and r/ is a large quantile of the cdf of p{e). 
Finally, the maximum likelihood estimates of (0,a) in model (3.3) are computed 
on the data points with Wj = 1. 

When we apply this robust procedure to (3.2) we obtain preliminary estimates 
{§i,§2)^^i standardized residuals 




Note that the constants uq and mo that are used in the definition of the S- 
estimators are found to be uq = —0.5700 and fco = 0.9466. From the model cdf 
Fq^i{z) = 1 — for 2 ; > —1, it follows that p{e) is standard exponentially 
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distributed, hence we set rj = — log(l - 0.95), the 0.95-quantile of the standard ex- 
ponential distribution. Of course, also other quantiles could be considered as well. 
We now indicate the zero weight log-spacings yj as outliers and remove the corre- 
sponding Xi from our data set. More precisely, let yj be the outlying log- spacing 
with largest index, then all Xi with n — J -\-l < i < n are flagged as outliers under 
the Pareto model. 




Figure 4. Hill (solid line) and maximum likelihood (broken line) 
estimates of the conditional Ca measurements after rejection of 
the outliers found for each fc, together with the Hill (solid grey 
line) and maximum likelihood (broken grey line) estimates before 
rejection. 



4. Application 

The results after applying our proposal to the Condroz data, are depicted in Figure 
4. Note that we have set p = —1 in (3.1). We have also tried some other values 
of p between —2 and 0 but this did not change the result. The broken black line 
shows the maximum likelihood estimates after removal of the outliers that were 
found for each fc, whereas the solid black line exposes the Hill estimator on the 
same reduced data sets. On this plot we have also superimposed (in grey) the ML 
and Hill curves of Figure 3 that are based on the full data set. We see that both 
robust curves yield much lower estimates for the tail index. Moreover they are 
rather constant for intermediate values of fc, which gives support to the Pareto 
model assumption. For very small values of fc, the estimates are however still not 
stable. This is due to the small number of observations in the regression model 
(3.2), where even an estimator with 50% breakdown is biased in the presence of 
outliers. 

It is observed that our method found six outliers for most k values, except for 
the smaller ones. The dark vertical line in Figure 4 is set at kopt = 85. To obtain 
this optimal sample fraction for the Hill estimator, we computed its AMSE as in 
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(2.9). Since not all the ML estimates were based on the same data set for all fc, here 
the ML estimators were computed on the data set after the six largest observations 
were removed. The AMSE curve, shown in Figure 5, attains its minimum at fcopt = 
85, from which 7/copt,H — 0-177 follows. 

o 

s 
s 

3 ^ 
s 

o 




Figure 5. AMSF estimates for the Hill estimator after rejection 
of the six largest observations. 

Figure 6 shows the pj = p{vj) for the regression data {tj^yj) from model 
(3.2) for k = fcopt = 85, together with the cut-off line that separates the outliers 
from the regular observations. We see that is the observation with largest index 
that exceeds the cut-off, hence six observations are flagged out as being unlikely 
under the Pareto model. This confirms our findings from the Pareto quantile plot 
in Figure 2(b). 




Figure 6. The p ( rj ) for the regression data {tj^yj) from model 
(3.2) with fc = 85 together with the expected value and cut-off 
line that separates the outliers from the regular observations. 
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5. Conclusions and Outlook 

In this paper we have introduced a new robust estimator of the tail index of Pareto 
type distributions. It is obtained by applying a robust regression estimator for a 
linear model with asymmetric errors to scaled log-spacings. When we applied this 
method to Ca and pH measurements from the Condroz region in Belgium, we 
could easily identify the outliers that were also seen on a Pareto quantile plot, and 
we obtained a much lower estimate of the tail index. 

In our further research we will study this robust estimator in more detail. We 
will investigate its breakdown value, its influence function and its performance at 
simulated data sets. 

As a different approach, we will also study the application of a robust regres- 
sion estimator to the Pareto quantile plot data (2.3). In particular we will consider 
the deepest regression method (Rousseeuw and Hubert, 1999) as it yields consis- 
tent estimates under heteroscedastic error distributions. Nonconstant variances are 
likely to occur in quantile plots. 
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Robust Processing of Mechanical 
Vibration Measurements 

S. Vanlanduit and P. Guillaume 

Abstract. When performing mechanical vibration measurements, it is often 
assumed that the disturbing measurement noise behaves Gaussian. In reality, 
this is not always the case, and therefore classical least-squares procedures can 
give poor results when processing the measurements. In this paper, several 
robust statistical procedures will be used in four selected problems in vibration 
measurement and analysis. It will be shown that the procedures improve the 
results compared to classical least-squares methods. Because fast processing 
is required, a tradeoff between robustness of the methods and computation 
speed is made. In particular the following steps in vibration engineering were 
robustified: (1) positional calibration, (2) measurement post-processing, (3) 
system identification and (4) data classification for damage detection. 

Mathematics Subject Classification (2000). Primary 74H45; Secondary 93E12. 
Keywords. Robust, vibration analysis, signal processing. 



1. Motivation 

An important research domain in the field of mechanical engineering, is the analysis 
of vibrating systems. This so-called ‘modal analysis’ is an important tool for car 
manufacturers who want to determine the sound field inside a vehicle. Also, modal 
analysis offers the capability of detecting damage in structures (like for instance 
bridges or airplane components). 

A modal analysis typically requires three steps. Firstly, the structure is ex- 
cited and measurement data is collected from vibration sensors. Secondly, - after 
pre-processing - this data is processed using parameter estimation methods in or- 
der to obtain a parametric model of the structure under test. In a third step, this 
model can be used to draw conclusions on the structural behavior (acoustic ra- 
diation, presence of damage,. . . ). Alternatively, this post-processing step can also 
be performed using nonparametric data (as for instance the so-called Frequency 
Response Functions or FRFs). 
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In general, the measurement noise on the vibration data is assumed to be 
normally distributed. However, in reality this is not always the case. Indeed, some 
of the sensors can fail, giving rise to outliers in the measurements. In addition, 
even when the sensor works correctly, disturbing noise with a large amplitude 
(e.g., from neighboring machines) can seriously hamper the successful processing 
of data using classical (least-squares) techniques. 

In this article, an overview of potential applications of robust processing 
techniques in vibration engineering is given based on several case studies. For 
each of the considered case studies, a classical least-squares and a proposed robust 
solution is given. The case studies represent different steps typically executed in 
vibration engineering: 

• Calibration of measurement equipment: a robust technique for the calibration 
of an optical scanning vibrometer (Section 2). 

• Robust pre-processing of measured vibration spectra (Section 3): elimination 
of measurement dropouts. 

• Robust model identification (Section 4). 

• Data classification for structural damage detection using Principal Compo- 
nent Analysis (Section 5). 

2. Laser Position Calibration 

A particular measurement instrument which is used a lot when the vibration re- 
sponse at a large number of locations is desired is the scanning Doppler laser 
vibrometer (SLDV) (Halliwell, 1979). This device discretely scans spatial loca- 
tions with a laser beam, which is moved horizontally and vertically with the aid of 
2 rotating mirrors. Before the measurements can be done, a calibration between 
scanning mirror angles {Oi,(f)i) and the coordinates of the pixels {pi^Qi) (corre- 
sponding to the location of the laser beam on the object under test) has to be 
performed. When the laser beam is oriented at randomly selected angles (0i,0^), 
the relation between the mirror angles and the coordinates is given by: 



Po + kidt&nOi 


(2.1) 


tSLiUpi 

qo + k2d 

cos Ui 


(2.2) 



where Po,qo,ki,k 2 and d depend on the position of the SLDV instrument (with 
respect to the position of the object) and the camera zoom. When considering Np 
random angles {Oi^(j)i) Equations (2.2) and (2.1) can be written in matrix form: 

Ac = b (2.3) 

where A is an 2Np by 4 matrix, b a 2Np length vector containing the pixel co- 
ordinates pi, . . . ,PNpiQu • • • , QNp and c = [po, kdi,kd 2 \. The calibration is per- 
formed by solving the system of equations in Equation 2.3. Unfortunately, some 
of the coordinates pi^qi (i.e. the elements of b) have errors orders of magnitude 
larger than the average error. This means that when solving Equation 2.3 using a 
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(3) 




Figure 1. (a): Residual errors Ac — b for the Least-squares 
and robust Least-squares solution of the laser position calibra- 
tion problem, (b-c): Validation experiment for least-squares (b) 
and least trimmed squares (c) solution: target position on location 
(3,1). 



least-squares regression, an incorrect calibration is performed. This can be clearly 
seen in the LS residual errors Ac — b in Figure la. Using a robust regression algo- 
rithm - like the least trimmed squares (LTS) (Rousseeuw and Van Driessen, 2000) 
- a much better result can be obtained (see Figure la). When the parameters c are 
estimated, the angles (0, 0) which are needed to place the laser beam at a certain 
validation location (p, q) can be computed. In the validation experiment shown in 
Figures lb and Ic the laser beam is targeted onto position (3,1) of a grid on a 
paper. It is clear that using the LS calibration it is not possible to hit the correct 
location (3,1), while the robust LTS solution exactly aims at the target location. 
More details on the problem can be found in Vanlanduit et al. (2003). 
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3. Pre-processing of Measurement Spectra 

When using optical vibration measurement sensors on dark structures, the reflected 
laser light can suddenly drop out, resulting in outliers (see Figure 2a). Instead of 
using time-domain measurements x{t) sampled at time instances ^ NtSt^ in 

vibration analysis, the data is usually transformed into the Fourier domain using 
a Discrete Fourier Transform (DFT): 

X{iu) = DFT{x{t)) (3.1) 

the result is that the error from outliers is smeared out over the full frequency band 
in the spectra X{uj) (as can be seen clearly in Figure 2b). The DFT transforma- 




W (b) 



Figure 2. Velocity measurement without and with outliers. Left: 
time domain, right: frequency domain. 



tion in Equation 3.1 can equivalently be expressed using a matrix transformation 
(although in practice a faster transformation - the Fast Fourier Transform or FFT 
- algorithm is used): 

X = Wx (3.2) 

with X and x length Nf and Nt vectors respectively (with Nt and Nf the number 
of measured time samples and excited frequencies respectively) and W a known 
Nf by Nt matrix (typically Nt = 1024 and Nf = 512). In principal, existing robust 
regression methods could be used to solve Equation 3.2 in order to be resistent to 
outliers. However, for high spatial resolution measurements, thousands of spectral 
measurements have to be performed. Therefore, the processing time should be 
limited to less than a second, excluding all robust regression methods known to 
the authors. 

Instead, the following quasi-robust - but fast - procedure is proposed: 
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Algorithm: Iterative two step LS 

1. Apply a median filter to [x{5t ), . . . , x{NtSt)] to obtain [x{5t )^ . . . , x{Nt5t)] 

2. Compute the Median Absolute deviation (MAD) of the residual e — x — x\ 

MAD{e) = median(|e — median(e)|) (3.3) 

When divided by 0.6745 the MAD is a robust estimator of the scale which is 
consistent with the normal distribution (Huber, 1981). 

3. Compute the set I of timesamples which are outliers: 

I = {i \ \e{iSt)\ > 5 X MAD}. Put x = x{I). Remark that for a normal 
distribution with outliers it would be more natural to take e.g. 1.96*MAD 
(to obtain the a = 5% confidence level). Prom experimental investigations 
it was observed that the tail of the measurement noise distribution (without 
any outliers included) is much heavier than for a normal distribution, and 
therefore a higher value was needed. The value of 5*MAD was chosen because 
it worked well on all investigated experimental data. 

4. Compute a cubic spline interpolant x of x in the time samples 5t, . . . , NtSt 
(Dierckx, 1993). 

5. Put x{I) = x{I). 

6. Compute the low-pass filtered x of x. 

7. Put x{I)=x{I). 

8. Repeat Steps 6. to 7. Nuer times. 

9. The processed data is equal to: x = x. 

In Vanlanduit et al. (2002) it is demonstrated that the Iterative two step LS 
solution converges to the global solution. In Figure 3 the results of the quasi-robust 
processing of vibration measurements with a dropout are given. It is clear, that 
the results using the introduced method improve drastically (the two vibration 
modes around 470Hz and 540Hz are much more clear in Figure 3d than in Figure 
3c). Less than a tenth of a second of processing time is needed for Nuer = 100 
iterations on a Pentium 4 1.4GHz processor. 

In the next section the implementation of a robust modelling technique for 
vibration measurements will be illustrated. 



4. Modal Parameter Estimation 



In vibration engineering, the mechanical structure is excited with a force x{t) and 
the resulting response y{t) is measured. Prom these measurements x{t) and y{t) 
the spectra X{uj) and Y{lj) are computed (using a DFT) which in turn lead to 
the computation of the so-called Frequency Response Function (FRF) H\ H = X 
Each vibrating system can be modelled as a fraction of polynomials in 
and the measured FRF H is used to identify the polynomial’s coefficients, which 
represent the system parameters, this can be done by minimizing the error E: 



E{u) = H{lu) - 



A{w) 



(4.1) 
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Time, In Time, in »c» 



(a) (b) 




Figure 3. (a) Raw time data x and reference data without out- 
lier, (b) quasi-robust pre-processed time data x (see Iterative 2 
step LS algorithm), (c) raw frequency domain data X and ref- 
erence data without outlier, (d) quasi-robust pre-processed fre- 
quency domain data FFT(x). Two vibration modes are visible by 
the peaks around 470Hz and 540Hz. 



where A and B are polynomials in A first possibility to estimate the model 

parameters is by minimize a quadratic cost function a^ 2 ‘ 



Nf 






k=l 



HM 



B{ujk,0) ^ 
A{uk,0) 



(4.2) 



This cost function gives high weights to outlying values in the error. A much more 
robust alternative is the use of a logarithmic cost function hiiog (Guillaume et al.. 
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^log(^) — 

k=l 

Both cost functions are nonlinear in the parameters and therefore their estimation 
is quite involved (an iterative scheme like the Levenberg-Marquard optimization 
algorithm is often used to perform this task). The logarithmic cost function is 
much more robust than the least-squares solution (indeed, the logarithmic trans- 
formation gives a lower weight to outliers in the error E) Remark that the solution 
is not truly robust, since the influence of outliers is not bounded. Moreover, it was 
shown in Guillaume et al. (1995) that the logarithmic cost solution leads to nearly 
efficient estimates. 

In Figure 4, an FRF with outliers due to disturbing 50Hz and harmonics 
components is shown. It can be seen that the logarithmic cost function (Figure 
4b) is able to identify the vibration mode at 120Hz (although some bias is present), 
while the quadratic cost function (Figure 4a) gives completely erroneous results. 



log 



Bjuk^O) 

H{ujk)A{uJk,0) 



(4.3) 




Figure 4. (a) Estimated solution using quadratic cost function, 
(b) estimated solution using logarithmic cost function. 



5. Damage Detection 

When a structure is damaged (cracks, delaminations, loose parts) its vibration 
behavior changes. This can be observed through the FRFS H{uj) of the structure: 
the vibration modes (peaks in the FRFs) are shifted towards lower frequencies. 

Consider a matrix H = [|ifi|, . . . , where each column of H contains 

an FRF of a structure measured in a certain condition. The goal is to identify 
the outliers between the n conditions (these outliers then represent the damaged 
specimen). Because FRF of an intact structure form a subspace, the outlier de- 
tection can be done using Principal Component Analysis (PC A). However, FRFs 
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of a damaged structure do not belong to the subspace of the intact structure. By 
plotting the distance of the n different objects to the subspace it is possible to 
detect damaged specimen. However, by using classical PGA the correct subspace 
is not found back because of the presence of the damaged FRFs which are outliers. 
Therefore, a robust PGA should be used. In this paper the RAPCA algorithm (Hu- 
bert et al, 2002) was used to perform this task. In Figure 5a 27 FRFs, measured 
at different time periods and in different conditions (21 intact and 6 damaged) are 
shown. From Figure 5b it can be seen that the 6 damaged beams can be identified 
from the robust PGA, while the classical PGA erroneously detects the cracked 
beams (1 intact and only 3 damaged beams are marked as an outlier). 




Figure 5. (a) Frequency Response Functions of a steel beam for 
27 conditions (21 intact and 6 cracked), (b) Robust versus classical 
PGA distance for the 27 FRFs in (a). 



6. Conclusions 

From the examples in the paper, it is clear that there is an important need for 
robust procedures to process vibration measurement data. This is in particular the 
case when optical vibration measurement instrumentation is used. On the other 
hand, because of the high amount of data that should be processed (typically 
thousands of measurement records of a few thousands of time samples) a tradeoff 
between robustness and computational efficiency has to be made. In the current 
paper four adapted robust processing techniques were proposed for the applica- 
tion during the different steps of the vibration engineering procedure: calibration, 
pre-processing, model identification and damage detection. It was shown on ex- 
perimental data that the results were much better than the classical least-squares 
solutions. 
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Quadratic Mixed Integer Programming Models 
in Minimax Robust Regression Estimators 

G. Zioutas 

Abstract. The robust estimation of regression parameters is formulated in 
terms of mixed integer-quadratic programming problem. The main contribu- 
tion of this technique is that it improves the estimator efficiency by down- 
weighting only bad influential points, either y-outliers or x-outliers. We fol- 
low the minimax strategy where the objective function of our mathematical 
programming formulation is mainly a Huber loss function, and bad influen- 
tial outliers pulled towards the regression line with low cost. This penalized 
pulling cost is a function of Mallows type weights, and in the modified data 
a GM estimator (Schweppe type) could be defined. The main advantage of 
the proposed technique is that data points are not down-weighted, unless 
they have increased substantially the square residuals. Previously published 
mixed integer formulations withdraw data points, the most influential even if 
they are not bad influential points. GM estimators are compared to our pro- 
posal via simulated experiments, the robust estimator obtained by quadratic 
programming is reasonable. 

Mathematics Subject Classification (2000). Primary 05C38, 15 A 15; Secondary 
05A15, 15A18. 

Keywords. Bounded influence, outliers, quadratic programming, mixed integer 
programming, robust regression. 



!• Introduction 

It is well known that least squares estimators are very sensitive to departures from 
the normality assumption on the error distribution (Hampel et al., 1986 and Huber, 
1981). These deviations are usually referred to as outliers in statistics and affect 
the estimators. Different kinds of outliers or influential points can occur for many 
reasons, but in this paper we consider the estimation of regression parameters 
mainly in the presence of x-space outliers (high leverage points) in the data which 
could be distinguished in ‘‘bad” leverage points and “good” leverage points. 
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- “Bad” leverage points are regression outliers, which do not lie in the pattern 
of the bulk of the data and are highly influential. It is possible that only a 
single “bad” leverage point can render least squares regression coefficients 
meaningless. Therefore, their influence should be limited. 

- “Good” leverage points are high leverage points, which lie in the pattern 
of the bulk of the data. Usually, “good” leverage points do not correspond 
to large residuals. They carry valuable information about the regression line 
and contribute to the precision of the parameter estimation. Therefore, such 
points should not be downweighted, otherwise significant information will be 
lost (Krasker and Welsch, 1982). 

A group of robust methods has been developed where the estimators are less sensi- 
tive to outliers. Maximum Likelihood type estimation (M-estimation), and General 
Maximum Likelihood type estimation (GM-estimates), with Mallows weights or 
cutting big residuals with Schweppe forms are few among the popular robust tech- 
niques. In this paper we will be concerned with bounded influence estimates (GM), 
by combining Huber loss function. Mallows weights and Schweppe residual size. 

It is considered whether “good” leverage points could be weighted appropri- 
ately (reducing down-weight) so as to increase estimator precision. Our basic idea 
was to restrict the down-weighting activation only for the largest residuals. 

In the presented approach, the influence of every outlier is bounded by fol- 
lowing the general robust procedure, where an observation point {xi^yi) could be 
down-weighted by pulling y^-values towards its fitted values. The pulling distance 
for each yi is penalized proportionally to given weights w{xi)^ which have been 
proposed by Hampel for the Mallows- type estimator (Hampel, 1978, 1986). The 
proposed estimator is defined by minimizing an objective function similar to Hu- 
ber’s “loss” function, which is the sum of squared winsorized residuals and the 
penalty cost of the overall pulling. 

The robust procedure has been formulated by Gamarinopoulos and Zioutas 
(2002) as an optimum allocation problem, where the objective function (“loss” 
function) is minimized subject to a given total distance of overall pulling. Thus, 
the available pulling resource is optimally allocated to outlying points, with respect 
to both the residual size and the pulling cost. Since the amount of overall pulling 
is limited, it may not be sufficient to pull all the high leverage points, which 
correspond to a low penalty cost. Therefore, the down-weighting of “good” leverage 
points is reduced. A solution to the optimum allocation problem is obtained by 
using mathematical programming, an optimization technique suitable to many 
statistical procedures (Arthanari and Dodge, 1993). 

Alternatively, to down- weigh high influential points which corresponds only to 
big residuals, an improved mathematical programming formulation is developed in 
this work. Specifically, a quadratic mixed integer programming formulation is used 
with objective function a Huber type loss function. The penalty cost of pulling high 
leverage points is very low (Mallows’s weights), but the pulling activation starts 
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only for points which correspond to big residuals, suggested by Schweppe or other 
GM-estimators. 

The proposed technique has also the flexibility to establish constraints for the 
regression parameters by adapting the mathematical programming formula. Also, 
a sensitivity analysis, by fluctuating the pulling resource, may potentially provide 
insight regarding the most catastrophic outlier detection. 

The new estimator, called QMIP, is deflned by the minimization of a convex 
function, leading to a unique solution. Also, the bad influence of outlying points 
is bounded, consuming the least possible amount of overall pulling. These factors 
contribute to good robust properties (Huber, 1981). 

In the next section, we briefly review the robust algorithm for M and GM 
estimators. Section 3 describes the proposed QMIP estimator and the mathemat- 
ical programming formulation. The Monte Carlo design and results are presented 
in Sections 4 and 5. 



2. Robust Methodology 

Consider the linear regression model, 

y = x^l3 + e ( 2 . 1 ) 

where y is the response variable, x is a p x 1 vector of explanatory variables, f3 is 
a p X 1 vector of unknown parameters, and e is a random error with expectation 
zero and variance cr^. We observe a sample (xi,pi), . . . , (xn,Pn) and we wish to 
construct a robust estimator in the sense that the influence of any observation 
{xi^yi) on the sample estimator is bounded. Of course, the original least squares 
estimate of j3 does not satisfy the robustness requirement. 



2.1. “Loss” Function 



For known scale parameter <j, an M-estimator of Huber-type (Huber, 1973, 1981) 
could be deflned by minimizing the sum of less rapidly increasing functions of the 
residuals, 

n 



minimize E pc{ui/a)(j‘^ 



( 2 . 2 ) 



i=l 



where pc is a less rapidly increasing function known as “loss” function of the 
residuals 

Ui yi-xjfi 



Pc{uija) 



\[uijG)^ for \uil(j\ < c 

c{ui/(j) — (? j2 for \ui/a\ > c. 



(2.3) 



Differentiating the expression (2.2) with respect to the regression coefficient 
p yields 

n 

y] V’c(wi/(T)xij =0 for j = 1, . . . 

i=l 



(2.4) 
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where ipc is the derivative of p and from (2.3) we obtain 






Ui/(T for \ui!a\ < c 
c for \uil(j\ > c. 



(2.5) 



Even though V^c() limits the effect of large residuals, the influence of an 
observation {yi^Xi) can be arbitrary large because xi multiplies Thus, the M- 
estimators do not bound the influence of outliers in the covariate space (regression 
outliers). 

In 1975 (Handschin et al., 1975), Schweppe modified (2.2), minimizing with 
respect to (3, a similar form of “loss” function, as follows: 

n 

'^a‘^v{xifpc{ui/{av{xi))) (2.6) 

i=l 

where v{xi) = y/1 — hi and hi are the diagonal elements of the matrix H = 
X{X'^X)~^X'^ . If the ith observation is an outlier in x-space, it is indicated 
by a large value of hi and the corresponding small value of v{xi) contributes to 
more drastic down-weighting. Thus, the solution of (2.6) could provide bounded 
influence estimator but with the additional property that if Ui/v{xi)a is small, the 
effect of v{xi) will be cancelled out. 

Mallows has proposed a bounded-influence estimator by minimizing 

n 

Y^(T^w{xi)pc{ui/a) (2.7) 

where w{xi) are certain weights that depend on the distance of Xi from the center 
of the multivariate regression space cloud. This estimator bounds the influence of 
an outlying Xi but there is one problem that is immediately apparent. Any down- 
weighting of high leverage points regardless of the underlying residual cannot be 
efficient. 

Schweppe-type estimator has the potential to help overcome some of the ef- 
ficiency problems noted for the Mallows estimator. Hill (1977) suggested using 
Schweppe forms with Mallows weights for more drastic down-weighting, and solv- 
ing 

n 

’^w{xi)xpc{ui/{w{xi)a))xij =0 forj = l,...,p (2.8) 

i=l 

found that this last estimator has an advantage over the above estimators. There- 
fore, the Schweppe objective function is typically preferred in most GM-estimators 
as well as in most compound estimators (Simpson and Montogomery, 1998). But, 
there is still a small problem when the (vi^Xi) is a “good” leverage point, corre- 
sponding to small residual Ui and very small w{xi) such that Ui/{w{xi)cr) > c. By 
down-weighting such points, some information may be lost. 

In 1977 Welsch obtained a bounded influence estimator by solving the above 
minimization problem where v{xi) = (1 — h^)/h?. In order to bound the influence of 
“gross” errors in the independent variables, Krasker and Welsch (1982) proposed a 
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robust estimator (KW) by limiting the sensitivity of the estimators in an efficient 
way. Their choice of v{xi) is similar to the last mentioned one. 

It is remarkable that most of the above estimators can be defined by solving 
the problem (2.2), but differ only in the choice of the tuning constant c (Huber, 
1983). Different values of the tuning parameter c, c = c^, lead to different bounded 
influence estimators (Krasker and Welsch, 1982). For example, if we wish to define 
a Schweppe-type estimator we minimize Huber’s loss function (2.2) or solve the 
equation system of (2.4) using c = Ci = \/l — hi 



2.2. General Robust Procedure 

Multiplying both sides of (2.3) by 2a^ and after simple calculus on the right-hand 
side we obtain the following equivalent “loss” function. 



2cr Pacri'^i) 



Ui^ for |u^| < Cid 

cfa‘^ -f 2ciaSi for \ui\ > cia 



(2.9) 



where si indicates the size of shorten big residuals ui following the manner: 



\ui\ — Cid for \ui\ > Cid 
0 for \ui\ < Cid 



(2.10) 



or equivalently, Si could be interpreted as the distance of pulling yi towards its 
fitted value and from now on we shall call this “pulling” distance. 

Combining equations (2.9) and (2.10), an alternative expression of the “loss” 
function (2.2) is obtained and the robust estimator can be defined by solving an 
equivalent problem. 



minimize 

/3 



'^{uf + 2Ci(T£i 



( 2 . 11 ) 



i=\ 



where u* are the modified residuals, which we call metrically Winsorized residuals. 



Ui 

Cid — Ui Si 

Cid — Ui Si 



for \ui\ < Cid 
for Ui> Cid 
for Ui < —Cid 



( 2 . 12 ) 



and 2cidSi can be considered as the penalty cost of pulling yi towards its fitted 
value for a distance Si. Obviously, for influential points with large value of hi the 
corresponding cutting parameter q is small leading to low pulling cost. Therefore, 
solving (2.11), the yi of high leverage points may be pulled for longer distances 
yielding more drastic down-weights. An important interpretation of the problem 
(2.11) is that the estimator is defined by modifying the observations yi with the 
smallest possible amount of overall pulling distances E = f^is leads 

to minimax robust estimator (minimax theory, Huber, 1981). 
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3. New Robust Approach 

The scope of the proposed quadratic programming formulation (Camarinopoulos 
and Zioutas, 2002) was based on the idea to restrict the down-weighting resource 
or equivalently to constrain the total pulling distance. Then, the estimator se- 
lectively down-weights “bad” leverage points by allocating optimally the pulling 
resource. As optimality criteria, the leverage measure hi and the residual U{ were 
considered. In case of unconstraint pulling resource, the proposed estimator QP 
leads to a Schweppe-type estimator where Mallows weights used as an argument in 
the -0- function, (Rousseeuw and Leroy, 1987; Coakley and Hettmansperger, 1993; 
Simpson and Montgomery, 1998). 

The proposed robust procedure has as starting point: 

- the weight proposals for Mallows- type estimator Wi = w{xi), 

- the total pulling distance E, obtained empirically from Mallows-type estima- 
tor. 



The robust estimator QP is defined by optimally allocating the total distance as 
in the following problem: 



minimize 

/3 



n 

+ 2cWi(T£i) 

i=l 



subject to: 



n 

Y^ei<E. (3.1) 

2=1 

In the above objective function the penalty cost, 2cwiaei, of pulling yi towards its 
fitted value depends on weight Wi = w{xi) , which is based on Hampel’s leverage 
measure. For an outlying observation in the x-space the corresponding penalty 
cost is very small and thus the yi could be pulled easily towards the regression 
line, resulting in excessive down-weighting. Therefore, the proposed estimator QP 
is a bounded influence estimator. 

The above approach has the potential to restrict the total down-weighting re- 
source E. In an optimum solution of the problem (3.1), most of the down-weighting 
resource is allocated preferably to catastrophic outliers, like the “bad” leverage 
points, since the restricted resource E may not be sufficient for down-weighting 
every leverage point. 

But, a difficulty arises in the formulation (3.1) with the optimum size of 
overall pulling, E. The source of down-weighting is a significant factor for the 
robustness and effectiveness and the optimum size of this given constant has to be 
searched. 

In the following proposed approach the source of pulling is not restricted, 
but a new restriction inserts into the model instead. A point is modified only if 
its residual u is greater than the specified upper limit in Schweppe or other GM- 
estimator. In this section first two alternative models are described for the new 
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robust approach and then the estimates are obtained by solving the corresponding 
mathematical programming problem. 



3.1. The Proposed QMIP Estimators 

In the new proposal, the QMIPl estimator is defined by minimizing the same loss 
function as in (3.1), under a new constraint, concerning the starting rule of pulling 
activation. This constraint allows the shortening of big residuals according to a 
Schweppe form. 



minimize -\-2cwiaSi 

^ 2=1 



subject to: 



\ui\ — Si when \ui\ > c{y/l — hi)a 
\ui\ when \ui\ < c{y/l — hi)a. 



(3.2) 



In the above objective function the penalty cost, 2cwiasi^ of pulling yi to- 
wards its fitted value depends on Mallows’s weight Wi = w{xi)^ which is based 
on leverage measure. For an outlying observation in the x-space the corresponding 
penalty cost is very small and thus the yi could be pulled easily towards the regres- 
sion line, resulting in excessive down-weighting. Therefore, the proposed estimator 
is a bounded infiuence estimator. 



It should be noted here that: 

- the pulling cost is a significant factor for the degree of robustness, 

- the constraint concerning the residual size is a significant factor for the ef- 
fectiveness of the proposed estimator, indicating which data points must be 
down-weighted. 



An alternative formulation of the above approach leads to a new robust estimator 
QMIP2. By reducing the pulling cost to zero, large outliers (x-outliers, y-outliers) 
can be discarded from the sample. The QMIP2 estimator can be defined by solving 
the problem 



minimize ^{uf + Oei) 

^ i=l 

subject to: 

{ 3c(Vl — hi)a when \ui\ > 3c(\/l — hi)a 
Ui when \ui\ < 3c(\/l — hi)a. 



(3.3) 



The residual constraint determines which data points should be removed. In the 
robust procedure (3.3), a data point is pulled free towards the fitted value if its 
residual is bigger than 3c(jy/l — hi. Free pulling means that the corresponding 
point does not infiuence the estimation, this is equivalent with discarding it. 
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An interpretation of the residual constraint is the fixed penalty cost, 
(3cay/l - hi)‘^, for discarding a data point {xi^yi). Equivalently, this discarding 
cost can be included in the loss function as is shown in the mathematical formula 
(3.5) in the next paragraph. A significant advantage of this discarding cost is the 
effectiveness of the resulting estimator, since only catastrophic outliers are deleted 
from the sample. Also, the pulling cost for down-weighting “smaller” outliers could 
be included in the same loss function. 

For the proposed estimator QMIP2 defined by solving the problem (3.3), 
a better high break down point than for the above mentioned GM-estimators is 
expected. This is a consequence of the free pulling or equivalently discarding the 
bad outliers. In the proposed robust procedure, the loss function increases only 
by a fixed cost for discarding a bad outlier, it does not take into account how far 
it is. However, the break down property of the last QMIP2 estimator has to be 
searched. 

The objective functions in (3.2) and (3.3) are convex and therefore global op- 
timum solutions exist. The new estimators QMIPl and QMIP2 have the bounded 
influence property. Furthermore, infiuential points are down-weighted with the 
smallest possible amount of overall pulling and this gives good robust properties 
to the proposed estimators (Huber, 1981). 

The solutions of the above problems, (3.2) and (3.3), are obtained via qua- 
dratic and mixed integer programming as described in the following subsection. 



3.2. Formulation of Quadratic and Mixed Integer Programming Problem 

One technique that has the potential to increase the scope for application of 
efficient statistical methodology is mathematical programming. Barrodale and 
Roberts (1970, 1973) proposed a mathematical formulation for a regression model. 
Convex-programming theory has been developed for solving restricted least-squares 
problems (Arthanari and Dodge, 1993). 

So far, mathematical programming techniques have been used in regression 
models for various reasons. Normally, linear programming is employed in the con- 
text of the absolute error criterion. 

In robust regression, Musicant and Mangasarian (2000) presented a new ap- 
proach with a convex quadratic program, in order to solve Huber regression. They 
found their approach simpler and faster. 

We use a modification of the proposed quadratic programming formulation 
(Camarinopoulos and Zioutas, 2002). Thus, the minimization problem (3.2) for 
defining the robust QMIPl estimate of the parameter (3 is expressed as a quadratic 
mixed integer programming problem as follows, 

n 

E 2 

(u* + 2cWiaSi) 

I, 

subject to: 
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xf(3i -xJI32-\- u* + Si^yi 
xj(3i - xf/?2 - u* -Si<yi 

Si < (5il000 

u* > (3-4) 

/ 3 i ,/32 Si,u*, > 0 

Si : zero-one variable 
for z = 1,2,3, ... ,n 

where Si is a an integer decision variable (zero-one), indicating which points have to 
be pulled. The first two constraints determine the regression line and the resulting 
residuals. If Si = due to the third constraint the pulling distance Si can take 
positive value, and from the fourth constraint the modified residual could not be 
less than the specified size, since Si ^0. 

The regression coefficients / 3 f = (/?n, . . . , /?ip), = (^21, • • • , /?2p), are non- 

negative vector variables. An estimation of parameter /? is obtained by solving 
(3.4) and setting /3 = /?i — /02- 

The quadratic programming problem (3.4) has a continuous and convex ob- 
jective function and the constraints form a closed convex set. Therefore, during 
the minimization process the simplex method searches the feasible solutions on 
the extreme points or corners of the convex region. The iterative method stops at 
the extreme point, (^Si, /? 2 , u*, e, 5), which is the unique global minimum, since the 
objective function is continuous and convex. 

A final interpretation of the values of the decision variables (/?i,^ 2 ,^*,^,^) 
in the basic optimum solution of the above problem is that the objective function 
is minimized and all the constraints are satisfied. Especially, for (5^ = 1: 

- the corresponding point has to be modified by pulling it for a distance 5 ^, 

- the corresponding residual Ui is modified to so as u* = Ui — Si = c{^l — hi)a 

- the corresponding penalty cost is 2cwia£i. 

The mathematical formula for the QMIP2 estimator is very similar with the for- 
mula (3.4), as follows: 

n 

minimize > (u* -h 5v(3c(vl — hi) a)^ 2cicreA 
1=1 

subject to: 

xj^i -xjf32 + u* + Si + ei^yi 
- xf/?2 - u* -Si-ei<yi 

Si < 10000 

/?i ,/?2 Si,u*, 5i>0 

Si : zero-one variable 
for i = 1,2,3,..., n. 



(3.5) 
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When the binary decision variable 5i takes the value one, = 1, then from the 
third constraint Si takes a positive value. Since there is no any pulling cost the bad 
outlier can be pulled as long as it is required for a better fitting. In other words, 
the influence of a bad outlier is eliminated. For each such decision there is a fixed 
penalty cost in the loss function. The size of this fixed discarding cost could be a 
function of the Schweppe critical value. 

If it is desired, a second weight function could cooperate simultaneously in 
order to down-weight smaller outliers. Thus, a new decision variable e, presenting 
the pulling distance of the second weight function, inserts into the formula, and 
the corresponding pulling cost is included in the loss function. 

The quadratic programming still has a continue and convex objective func- 
tion, and therefore, a global optimum solution could be provided by the simplex 
method. 

The robust properties of the proposed estimators are illustrated numerically 
with Monte Carlo simulation, which is discussed in the following section. 



4. Simulation Design 

To evaluate the performance of the new robust approach, we carried out a sim- 
ulation study that compared it to well-known robust methods. To carry out one 
simulation run, we proceeded as follows. The distributions of independent vari- 
ables and the values of parameters were given. Errors were generated according to 
an error distribution and observations, were obtained following the regression 
model (2.1). In this study we considered five factors: 

1) number of regression outliers, 

2) number of ‘‘good” leverage points, 

3) tuning constants, 

4) true model, 

5) type of estimate. 

N = 100 replications have been chosen for this Monte Carlo study in order to 
obtain a relative error {J3j—l3j)/(5j < 10%forallj = 1,2,3,. ..,p, with a reasonable 
confidence level of at least 90% for all the simulation estimates. 

All the constants of the robust estimates are tuned so that all the resulting 
average weights are the same. Thus, for all methods we down- weigh approximately 
the same fraction of data. According to the robustness literature any other choice 
for the scale parameter does not change the performance of our proposed robust 
estimator. 

5. Results 

After 100 replications, using samples of size n = 25 for a Gaussian regression 
model contaminated by regression outliers and “good” leverage points, Monte 
Carlo results have been obtained. The three Tables presented correspond to QP, 
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QMIPl and QMIP2 estimators and the results are based on lightly different groups 
of data contaminated by regression outliers and “good” leverage points. 

The performance of the robust estimators is measured by Monte Carlo es- 
timates (actual calculations) of the following criteria: Mean Square Error, Mean 
Absolute Error, Norm of the Bias, and Trace of the Covariance matrix. 

All of the following conclusions were supported by careful examination of 
the individual estimates. Taking account the four performance criteria mentioned 
above, as it was expected, the QP, QMIPl and QMIP2 estimators had the best 
performance, in the presence of both regression outliers and “good” leverage points. 
Table 1, Table 2 and Table 3. 

Table 1. Number of regression outliers 3, “good” leverage 



points 3. j3i = 


1.20, ft 


; = -0.80. 
















Mean 


Mean 


Norm 


Mean 


Est. 


Var. 


Var. 


Trace 


Absol. 


Absol. 


of 


Squar. 




of 


of 


of 


Err. 


Err. 


Bias 


Fit. 




/5i 


132 


Covar 




^2 


0 


Error 


MLL 


.021 


.008 


.028 


.184 


.122 


.230 


277.8 


KW 


.029 


.011 


.039 


.154 


.104 


.198 


277.9 


QP 


.021 


.008 


.028 


.139 


.098 


.180 


272.8 


MH 


.028 


.010 


.038 


.176 


.116 


.222 


278.4 



Table 2. Number of regression 
points 3. f3i = 1.20, /?2 = —0.80. 


outliers 


3, “good 


[” leverage 








Mean 


Mean 


Estimators. Mean. 


Mean. 


Trace 


Square 


Squar. 


of 


of 


of 


Error 


Fit. 


01 


ft 


0 


of (3 


Error 


QMIPl 1.169 


-0.802 


0.009 


0.010 


261 


Schweppe-type, Mallows weights 1.148 


-0.787 


0.020 


0.023 


265 


Table 3. Number of regression 


outliers 


3, “good” leverage 


points 3. (3i = 1.20, (32 = —0.80. 
















Mean 


Mean 


Estimators. Mean. 


Mean. 


Trace 


Square 


Squar. 


of 


of 


of 


Error 


Fit. 


01 


ft 


0 


of/? 


Error 


QMIP2 1.145 


-0.806 


0.024 


0.022 


264 


Schweppe-type, Mallows weights 1.088 


-0.754 


0.054 


0.068 


275 
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5.0.1. Computation Time. All experiments ran on a 1200 Mhz Athlon AMD Pro- 
cessor. All computations of robust estimators were carried out using Fortran library 
ROBETH (1993), provided by Marazzi. A new optimization algorithm “solver 
FortMP/QMIP”( CARISMA, Brunei University) which has been developed for 
large scale mathematical programming problems can reduce the computational 
time of the solution of the above QP-Mixed Integer problems, making the new 
procedure reasonable and promising. 

For the QP estimator the computational time is competitive to the ordinary 
statistical procedures 

For the QMIPl estimator the computation is slower than the standard sta- 
tistical methods. But, for a medium size sample the computation is still fast. 
Therefore, the proposed approach is reasonable. 

For the QMIP2 estimator, generally, the computation is slow. For a sample 
size up to 50 or even 100 observations the solution is obtained quickly. Suitable 
algorithms or alternative formulations have to be searched for larger problems. 



6. Final Remarks 

The basic aim of this work was to develop useful mathematical programming for- 
mulas for robust estimation. The quadratic mixed integer programming method 
offers a systematic approach to improve the sample behavior of bounded influ- 
ence type estimator. Since regression outliers are the most dangerous and often 
“good” leverage points exist in the data, we recommend the use of the QP and 
QMIP estimator, which provides a good protection against bias while keeping good 
efficiency. 

Based on the above optimality criteria and results, we conclude that the 
QMIP estimator works well in many circumstances and is reasonable for all x- 
outliers including “good” leverage points. From this study, one can gain by using 
quadratic mixed integer programming for bounded influence regression estimators. 

Also, the mathematical programming technique in robust estimation offers flexi- 
bility in being able to 

- constraints for parameter vector (3 when it is required, for example in a 
outlier- multicollinearity problem (Simpson and Montogomery, 1996), 

- identify the most catastrophic outliers in a sample of observations, 

- combination of two or more weight functions, in order to bound the “posi- 
tion” weight function for points with small ft, a desirability which has been 
discussed by Huber and Mallows (Huber, 1983). 

Further research is needed to determine possible better choices of the combined 
weight functions and the modified residual size. 
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